Learning Objectives
Following this assignment students should be able to:
- connect to a remote database and execute simple queries
- integrate database and R workflow
- export output data from R to database
- tidy data table with redundant fields or overfilled cells
Reading
-
Topics
- SQL
- Using databases from inside of R
-
Readings
Lecture Notes
Exercises
Simple WHERE (10 pts)
A population biologist (Dr. Undomiel) who studies the population dynamics of Dipodomys spectabilis would like to use some data from the Portal Project, but she doesn’t know how to work with large datasets.
Write a query that returns the
Expected outputs for Simple WHERE: 1month
,day
,year
, andweight
of each individual of Dipodomys spectabilis (DS
in thespecies_id
column). Do not include thespecies_id
column in the output. Save this query as a view with the namespectabilis_population_data
.COUNT (10 pts)
Write a query that returns the number of individuals identified to species in each year (i.e., count the
Expected outputs for COUNT: 1species_id
column). Name the count columntotal_abundance
and sort it chronologically. Include the year in the output. Save it astotal_abundance_by_year
. There should only be one value for each year since this is a count of the individuals across all species in that year.Basic Join (20 pts)
Write a query that returns the
Expected outputs for Basic Join: 1year
,month
, andday
for each individual captured as well as it’sgenus
andspecies
names. This can be accomplished by joining thespecies
table to thesurveys
table using thespecies_id
column in both tables. Save this query asspecies_captures_by_date
.Multi-table Join (20 pts)
The
plots
table in the Portal database can be joined to thesurveys
table by joiningplot_id
toplot_id
and thespecies
table can be joined to thesurveys
table by joiningspecies_id
tospecies_id
.The Portal mammal data include data from a number of different experimental manipulations. You want to do a time-series analysis of the population dynamics of all of the species at the site, taking into account the different experimental manipulations. Write a query that returns the
Expected outputs for Multi-table Join: 1year
,month
,day
,genus
andspecies
of every individual as well as theplot_id
andplot_type
of the plot they are captured on. Save this query asspecies_plot_data
.Link to Databases (20 pts)
Let’s access an SQL database directly from R using
dplyr
.Either use an existing copy of the
portal_mammals.sqlite
database or download a new copy. You should be able to link to thesurveys
table in the database using:library(DBI) portaldb <- dbConnect(RSQLite::SQLite(), "portal_mammals.sqlite") surveys <- tbl(portaldb, "surveys")
surveys
is actually a connection to the database, which means that the table remains external to the R environment. Also, we won’t need to worry about it printing out huge numbers of rows when we look at it.- Select the
year
,month
,day
, andspecies_id
columns in that order. - Create a new data frame with the
year
,species_id
, and weight in kilograms of each individual, with no null weights. - Use the
distinct()
function to print thespecies_id
for each species in the dataset that has been weighed.
- Select the
Copy to Database (20 pts)
Dr. Undómiel has decided to focus on the change in size of a few target rodent species over the course of the experiment(1977-2002). She has chosen Dipodymys spectabilis, Onychomys torridus, Perymiscus erimicus, Chaetodipus penicillatus.
Write a script that uses
dplyr
to:- Connect to the
portal_mammals.sqlite
. - Generate a data frame with
year
,species_id
, and the average weight per year (avg_weight
) for each target species. You may find the%in% c()
construction useful for yourfilter()
. - Use
copy_to()
to include your new data frame inportal_mammals.sqlite
. Call it something informative so that Dr. Undómiel can find it easily. Make sure it remains after the connection is terminated usingtemporary = FALSE
.
- Connect to the