Working with Tabular Data

Remember to:

If teaching with RStudio Cloud or RStudio Server pre-load surveys.csv, species.csv, and plots.csv into the student’s working directory

Consider removing the dplyr package so you can demonstrate installing it.

Linux users: you may not want to do this because the source install is slow

Introduction to tabular data

We will be working with data from the Portal Project.
Long-term experimental study of small mammals in Arizona.

Setup local RStudio

Download surveys, species, and plots from Datasets into folder.
Need to know where the data is: Right click -> Save link as.
Start/open a project (modeling good practice)

Setup RStudio Clould

Go to the class space on RStudio Cloud
Click on this weeks assignment

Loading and viewing the dataset

Dataset is composed of three tables.
Load these into R using read.csv().

surveys <- read.csv("surveys.csv")
species <- read.csv("species.csv")
plots <- read.csv("plots.csv")

Display data by clicking on it in Environment
Three tables
- surveys - main table, one row for each rodent captured, date on date, location, species ID, sex, and size
- species - latin species names for each species ID + general taxon
- plots - information on the experimental manipulations at the site
Good tabular data structure
- One table per type of data
  - Tables can be linked together to combine information.
- Each row contains a single record.
  - A single observation or data point
- Each column or field contains a single attribute.
  - A single type of information

Packages

Main way that reusable code is shared in R
Combination of code, data, and documentation
R has a rich ecosystem of packages for data manipulation & analysis
Download and install packages with the R console:
- install.packages("dplyr")
Using a package:
- Load all of the functions in the package: library("dplyr")

Basic `dplyr`

Modern data manipulation library for R

surveys <- read.csv("surveys.csv")

Select

Select a subset of columns.

select(surveys, year, month, day)

They can occur in any order.

select(surveys, month, day, year)

Do Shrub Volume Data Basics 1-2.

Mutate

Add new columns with calculated values using mutate()

mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)

If we look at surveys now will it contain the new column?
Open surveys
All of these commands produce new values, data frames in this case
To store them for later use we need to assign them to a variable

surveys_plus <- mutate(surveys,
                       hindfoot_length_cm = hindfoot_length / 10)

Or we could overwrite the existing variable if we don’t need it

surveys <- mutate(surveys,
                  hindfoot_length_cm = hindfoot_length / 10)

Do Shrub Volume Data Basics 3.

Arrange

We can sort the data in the table using arrange
To sort the surveys table by by weight

arrange(surveys, weight)

We can reverse the order of the sort by “wrapping” weight in another function, desc for “descending

arrange(surveys, desc(weight))

We can also sort by multiple columns, so if we wanted to sort first by plot and then by date

arrange(surveys, plot, year, month, day)

Do Shrub Volume Data Basics 4.

Filter

Use filter() to get only the rows that meet certain criteria.
Combine the data frame to be filtered with a series of conditional statements.
Column, condition, value
To filter the data frame to only keep the data on species DS
- Type the name of the function, filter
- Parentheses
- The name of the data frame we want to filter, surveys
- The column the want to filter on, species_id
- The condition, which is == for “is equal to”
- And then the value, "DS"
- DS here is a string, not a variable or a column name, so we enclose it in quotation marks

filter(surveys, species_id == "DS")

Like with vectors we can have a condition that is “not equal to” using “!=”
So if we wanted the data for all species except “DS

filter(surveys, species_id != "DS")

We can also filter on multiple conditions at once
In computing we combine conditions in two ways “and” & “or”
“and” means that all of the conditions must be true
Do this in dplyr using additional comma separate arguments
So, to get the data on species “DS” for the year 1995:

filter(surveys, species_id == "DS", year > 1995)

Alternatively we can use the & symbol, which stands for “and”

filter(surveys, species_id == "DS" & year > 1995)

This approach is mostly useful for building more complex conditions
“or” means that one or more of the conditions must be true
Do this using |
To get data on all of the Dipodomys species

filter(surveys, species_id == "DS" | species_id == "DM" | species_id == "DO")

Do Shrub Volume Data Basics.

Filtering null values

One of the common tasks we use filter for is removing null values from data
Based on what we learned before it’s natural to think that we do this by using the condition weight != NA

filter(surveys, weight != NA)

Why didn’t that work?
Null values like NA are special
We don’t want to accidentally say that two “missing” things are the same
- We don’t know if they are
So use special commands
is.na() checks if the value is NA
So if we wanted all of the data where the weigh is NA

filter(surveys, is.na(weight))

We’ll learn more about why this works in the same way as the other conditional statements when we study conditionals in detail later in the course
To remove null values we combine this with ! for “not”

filter(surveys, !is.na(weight))

So !is.na(weight) is conceptually the same as “weight != NA”
It is common to combine a null filter with other conditions using “and”
For example we might want all of the data on a species that contains weights

filter(surveys, species_id == "DS", !is.na(weight))

Do Portal Data Manipulation 4-6.

Notes

Introduction to tabular data

Setup local RStudio

Setup RStudio Clould

Loading and viewing the dataset

Packages

Basic dplyr

Select

Mutate

Arrange

Filter

Filtering null values

Basic `dplyr`