Remember the basic rules of tidy data structure

  1. One column per type of information
  2. One row per observation
  3. One value in each cell
  • Unfortunately lots of existing data doesn’t follow these rules
  • Need to convert them to this tidy structure for analysis
  • Use a package called tidyr
install.packages("tidyr")
library(tidyr)
library(dplyr)

Pivot data from wide to long

  • One common issue is data spread over multiple columns that should be in one

Copy link to Western Ghats tree data from datasets page

raw_data = read.csv("http://datacarpentry.org/semester-biology/data/Macroplot_data_Rev.txt", sep = "\t")

View data

  • Data on tree girth from the Western Ghats
  • When a tree had multiple stems the diameter of each stem was entered in a separate column
  • What would a better structure be?

Lead discussion to correct structure

  • To convert the raw data into clean data we’ll use a data cleaning pipeline like the data manipulation pipelines we’ve developed previously
  • Let’s start by adding a treeid column to our data frame using the mutate function from dplyr
  • We want one treeid for each row because there is one tree for each row
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data))
  • To get the data in this form we can use pivot_longer
    • Removes redundant columns
    • Arguments:
      • data.frame
      • Columns to include (or not include)
      • names_to: the name of the new column to put the column names in
      • values_to: the name of the new column to put the column values in
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth")
  • The colon specifies all columns starting at TreeGirth1 and ending at TreeGirth5

View data

  • Still has zeros for where there were no stems, so filter these out
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0)

Extract

  • Want stem column to contain numbers 1-5 not TreeGirth1
  • extract()
    • Extracts one or more values from a column
    • Uses regular expressions
    • Arguments:
      • data.frame
      • Column name
      • Names of the new columns
      • Regular expression
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0) %>%
  extract(stem, 'stem', 'TreeGirth(.)')
  • TreeGirth. means the word “TreeGirth” followed by a single value
  • The () indicate what part of this to extract, so just the number at the end

  • This gives us the result we want, with just the stem number in the stem column
  • But you may notice that this number is on the left side of the column, not the right
  • That’s because the number is still stored as a character, because it was extracted from a string
  • To convert it to it’s actual type we can add the optional argument convert = TRUE to `extract
clean_data <- raw_data %>%
  mutate(treeid = 1:nrow(raw_data)) %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0) %>%
  extract(stem, 'stem', 'TreeGirth(.)', convert = TRUE)
  • This attempts to convert the values from characters to their actual type
  • This is a good addition when extracting numbers because then you can work with the column as numbers

Separate

  • Genus and species information are combined in a single column
  • separate()
    • Separates multiple values in single column
    • Arguments:
      • data.frame
      • Column name
      • New column names
      • Separator value, character, or position
clean_data <- raw_data %>%
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") %>%
  filter(girth != 0) %>%
  extract(stem, 'stem', 'TreeGirth(.)') %>%
  separate(SpCode, c('genus', 'species'), 4)

Unite and Pivot Wider

  • Sometimes we need to go in the other direction
  • Count the number of stems of each species on each plot
stem_counts <- clean_data %>% 
  group_by(PlotID, genus, species) %>% 
  summarize(count = n())
  • Software for running analysis requires cross-tab (or wide) data
  • Site in rows, species in columns, counts in cells
  • First need a single species ID
  • unite
    • Combine values from multiple columns into one
    • Arguments:
      • data.frame
      • New column name
      • Columns to combine
stem_counts_wide <- stem_counts %>% 
  unite(species_id, genus, species)
  • Then make the data wide
  • pivot_wider
    • Spread values across multiple columns
    • Arguments:
      • data.frame
      • Name of column to use for wide columns
      • Name of column containing the values for the cells
      • Optional fill argument for what to put in empty cells
stem_counts_wide <- stem_counts %>% 
  unite(species_id, genus, species) %>%
  pivot_wider(names_from = species_id, values_from = count)
  • This leaves null values when there is no value in the starting table
  • But we can replace this with something else using values_fill
stem_counts_wide <- stem_counts %>% 
  unite(species_id, genus, species) %>%
  pivot_wider(names_from = species_id,
              values_from = count,
              values_fill = list(count = 0))

Completing data with gaps

  • Some write out a value once and then leave the following rows blank
gappy_data <- read.csv("http://www.datacarpentry.org/semester-biology/data/gappy-data.csv")
gappy_data
  • This works well for humans, but not for computers
  • Can fill in these gaps using fill
clean_data <- gappy_data %>%
  fill(Species)
  • Fills down by default, but other directions are possible

  • Often data only includes observed values, but we need to list other values
  • Missing zeros or NA’s
clean_data <- gappy_data %>%
  fill(Species) %>%
  complete(Species, Individual)
  • Could also use this to add zeros to our long stem_counts data frame
stem_counts <- clean_data %>% 
  group_by(PlotID, genus, species) %>% 
  summarize(count = n()) %>%
  complete(PlotID, nesting(genus, species), fill = list(count = 0))