Remember to:
- If teaching with RStudio Cloud or RStudio Server pre-load
surveys.csv
,species.csv
, andplots.csv
into the student’s working directory- Consider removing the
dplyr
package so you can demonstrate installing it.
- Linux users: you may not want to do this because the source install is slow
Introduction to tabular data
- We will be working with data from the Portal Project.
- Long-term experimental study of small mammals in Arizona.
Setup local RStudio
- Download
surveys
,species
, andplots
fromDatasets
into folder. - Need to know where the data is: Right click ->
Save link as
. - Start/open a project (modeling good practice)
Setup RStudio Clould
- Go to the class space on RStudio Cloud
- Click on this weeks assignment
Loading and viewing the dataset
- Dataset is composed of three tables.
- Load these into
R
usingread.csv()
.
surveys <- read.csv("surveys.csv")
species <- read.csv("species.csv")
plots <- read.csv("plots.csv")
- Display data by clicking on it in
Environment
- Three tables
surveys
- main table, one row for each rodent captured, date on date, location, species ID, sex, and sizespecies
- latin species names for each species ID + general taxonplots
- information on the experimental manipulations at the site
- Good tabular data structure
- One table per type of data
- Tables can be linked together to combine information.
- Each row contains a single record.
- A single observation or data point
- Each column or field contains a single attribute.
- A single type of information
- One table per type of data
Packages
- Main way that reusable code is shared in R
- Combination of code, data, and documentation
- R has a rich ecosystem of packages for data manipulation & analysis
- Download and install packages with the R console:
install.packages("dplyr")
- Using a package:
- Load all of the functions in the package:
library("dplyr")
- Load all of the functions in the package:
Basic dplyr
- Modern data manipulation library for R
surveys <- read.csv("surveys.csv")
Select
- Select a subset of columns.
select(surveys, year, month, day)
- They can occur in any order.
select(surveys, month, day, year)
Mutate
- Add new columns with calculated values using
mutate()
mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
- If we look at
surveys
now will it contain the new column? - Open
surveys
- All of these commands produce new values, data frames in this case
- To store them for later use we need to assign them to a variable
surveys_plus <- mutate(surveys,
hindfoot_length_cm = hindfoot_length / 10)
- Or we could overwrite the existing variable if we don’t need it
surveys <- mutate(surveys,
hindfoot_length_cm = hindfoot_length / 10)
Arrange
- We can sort the data in the table using
arrange
- To sort the surveys table by by weight
arrange(surveys, weight)
- We can reverse the order of the sort by “wrapping”
weight
in another function,desc
for “descending
arrange(surveys, desc(weight))
- We can also sort by multiple columns, so if we wanted to sort first by
plot
and then by date
arrange(surveys, plot, year, month, day)
Filter
- Use
filter()
to get only the rows that meet certain criteria. - Combine the data frame to be filtered with a series of conditional statements.
- Column, condition, value
- To filter the data frame to only keep the data on species
DS
- Type the name of the function,
filter
- Parentheses
- The name of the data frame we want to filter,
surveys
- The column the want to filter on,
species_id
- The condition, which is
==
for “is equal to” - And then the value,
"DS"
DS
here is a string, not a variable or a column name, so we enclose it in quotation marks
- Type the name of the function,
filter(surveys, species_id == "DS")
- Like with vectors we can have a condition that is “not equal to” using “!=”
- So if we wanted the data for all species except “DS
filter(surveys, species_id != "DS")
- We can also filter on multiple conditions at once
- In computing we combine conditions in two ways “and” & “or”
- “and” means that all of the conditions must be true
- Do this in
dplyr
using additional comma separate arguments - So, to get the data on species “DS” for the year 1995:
filter(surveys, species_id == "DS", year > 1995)
- Alternatively we can use the
&
symbol, which stands for “and”
filter(surveys, species_id == "DS" & year > 1995)
-
This approach is mostly useful for building more complex conditions
- “or” means that one or more of the conditions must be true
- Do this using
|
- To get data on all of the Dipodomys species
filter(surveys, species_id == "DS" | species_id == "DM" | species_id == "DO")
Filtering null values
- One of the common tasks we use
filter
for is removing null values from data - Based on what we learned before it’s natural to think that we do this by using the condition
weight != NA
filter(surveys, weight != NA)
- Why didn’t that work?
- Null values like
NA
are special - We don’t want to accidentally say that two “missing” things are the same
- We don’t know if they are
- So use special commands
is.na()
checks if the value isNA
- So if we wanted all of the data where the weigh is
NA
filter(surveys, is.na(weight))
-
We’ll learn more about why this works in the same way as the other conditional statements when we study conditionals in detail later in the course
-
To remove null values we combine this with
!
for “not”
filter(surveys, !is.na(weight))
- So
!is.na(weight)
is conceptually the same as “weight != NA” - It is common to combine a null filter with other conditions using “and”
- For example we might want all of the data on a species that contains weights
filter(surveys, species_id == "DS", !is.na(weight))