Remember the basic rules of tidy data structure
- One column per type of information
- One row per observation
- One value in each cell
- Unfortunately lots of existing data doesn’t follow these rules
- Need to convert them to this tidy structure for analysis
- Use a package called
tidyr
install.packages("tidyr")
library(tidyr)
library(dplyr)
Pivot data from wide to long
- One common issue is data spread over multiple columns that should be in one
Copy link to Western Ghats tree data from datasets page
raw_data = read.csv("http://datacarpentry.org/semester-biology/data/Macroplot_data_Rev.txt", sep = "\t")
View data
- Data on tree girth from the Western Ghats
- When a tree had multiple stems the diameter of each stem was entered in a separate column
- What would a better structure be?
Lead discussion to correct structure
- To convert the raw data into clean data we’ll use a data cleaning pipeline like the data manipulation pipelines we’ve developed previously
- Let’s start by adding a
treeid
column to our data frame using themutate
function fromdplyr
- We want one
treeid
for each row because there is one tree for each row
clean_data <- raw_data %>%
mutate(treeid = 1:nrow(raw_data))
- To get the data in this form we can use
pivot_longer
- Removes redundant columns
- Arguments:
data.frame
- Columns to include (or not include)
names_to
: the name of the new column to put the column names invalues_to
: the name of the new column to put the column values in
clean_data <- raw_data %>%
mutate(treeid = 1:nrow(raw_data)) %>%
pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth")
- The colon specifies all columns starting at
TreeGirth1
and ending atTreeGirth5
View data
- Still has zeros for where there were no stems, so filter these out
clean_data <- raw_data %>%
mutate(treeid = 1:nrow(raw_data)) %>%
pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
filter(girth != 0)
Extract
- Want
stem
column to contain numbers 1-5 notTreeGirth1
extract()
- Extracts one or more values from a column
- Uses regular expressions
- Arguments:
data.frame
- Column name
- Names of the new columns
- Regular expression
clean_data <- raw_data %>%
mutate(treeid = 1:nrow(raw_data)) %>%
pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
filter(girth != 0) %>%
extract(stem, 'stem', 'TreeGirth(.)')
TreeGirth.
means the word “TreeGirth” followed by a single value-
The
()
indicate what part of this to extract, so just the number at the end - This gives us the result we want, with just the stem number in the
stem
column - But you may notice that this number is on the left side of the column, not the right
- That’s because the number is still stored as a character, because it was extracted from a string
- To convert it to it’s actual type we can add the optional argument
convert = TRUE
to `extract
clean_data <- raw_data %>%
mutate(treeid = 1:nrow(raw_data)) %>%
pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
filter(girth != 0) %>%
extract(stem, 'stem', 'TreeGirth(.)', convert = TRUE)
- This attempts to convert the values from characters to their actual type
- This is a good addition when extracting numbers because then you can work with the column as numbers
Separate
- Genus and species information are combined in a single column
separate()
- Separates multiple values in single column
- Arguments:
data.frame
- Column name
- New column names
- Separator value, character, or position
clean_data <- raw_data %>%
pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
filter(girth != 0) %>%
extract(stem, 'stem', 'TreeGirth(.)') %>%
separate(SpCode, c('genus', 'species'), 4)
Unite and Pivot Wider
- Sometimes we need to go in the other direction
- Count the number of stems of each species on each plot
stem_counts <- clean_data %>%
group_by(PlotID, genus, species) %>%
summarize(count = n())
- Software for running analysis requires cross-tab (or wide) data
- Site in rows, species in columns, counts in cells
- First need a single species ID
unite
- Combine values from multiple columns into one
- Arguments:
data.frame
- New column name
- Columns to combine
stem_counts_wide <- stem_counts %>%
unite(species_id, genus, species)
- Then make the data wide
pivot_wider
- Spread values across multiple columns
- Arguments:
data.frame
- Name of column to use for wide columns
- Name of column containing the values for the cells
- Optional
fill
argument for what to put in empty cells
stem_counts_wide <- stem_counts %>%
unite(species_id, genus, species) %>%
pivot_wider(names_from = species_id, values_from = count)
- This leaves null values when there is no value in the starting table
- But we can replace this with something else using
values_fill
stem_counts_wide <- stem_counts %>%
unite(species_id, genus, species) %>%
pivot_wider(names_from = species_id,
values_from = count,
values_fill = list(count = 0))
Completing data with gaps
- Some write out a value once and then leave the following rows blank
gappy_data <- read.csv("http://www.datacarpentry.org/semester-biology/data/gappy-data.csv")
gappy_data
- This works well for humans, but not for computers
- Can fill in these gaps using
fill
clean_data <- gappy_data %>%
fill(Species)
-
Fills down by default, but other directions are possible
- Often data only includes observed values, but we need to list other values
- Missing zeros or
NA
’s
clean_data <- gappy_data %>%
fill(Species) %>%
complete(Species, Individual)
- Could also use this to add zeros to our long
stem_counts
data frame
stem_counts <- clean_data %>%
group_by(PlotID, genus, species) %>%
summarize(count = n()) %>%
complete(PlotID, nesting(genus, species), fill = list(count = 0))