Remember to
- Install
DBI
,RSQLite
, anddbplyr
- Download
portal_mammals.sqlite
.
Why Databases
- Good place to store data
- Lots of existing data is stored in them (which you might want to use)
- Support real-time collaboration on adding and updating data
- Handle out of memory computation
Database Queries
- Data is stored in tables - equivalent of data frames
- Queries store questions about the data - equivalent of dplyr pipelines
- These queries are written in “Structured Query Language” or SQL for short
- Some folks also pronounce this “sequel”
Lots of SQL Databases
- MySQL
- PostgreSQL
-
MS Access
- We’ll use SQLite
- Simple - each database is just a file
- Works basically anywhere with no setup
- Great for relatively small (million row) projects that don’t require simultaneous data entry by multiple folks
- Recommend PostgreSQL if you need something more powerful
Using RStudio with SQLite
- We can RStudio to connect to SQL databases and run queries in them
- We’ll start by doing this directly to learn SQL
-
Then we’ll learn how to work with databases inside our R scripts
- First let’s install some packages that RStudio needs
install.packages(c('DBI', 'RSQLite'))
- Then download the SQLite database we’re going to work with
download.file("https://ndownloader.figshare.com/files/11188550",
"portal.sqlite",
mode = "wb")
- The
mode = "wb"
makes sure that this binary file downloads properly on all operating systems - Finally connect to the database so that we can work with it
New File
->SQL Script
- The special comment on Line 1 is RStudio’s way of connecting to the database
- Add
, dbname = "portal.sqlite"
to get
-- !preview conn=DBI::dbConnect(RSQLite::SQLite(), dbname = "portal.sqlite")
- Save file
- You should see the number 1 as output
Selecting columns
- Choose which columns using
SELECT
- If we want to get all of the columns we can use
*
, which is a wildcard that means “all”.
SELECT *
FROM surveys;
- To select specific columns we list them by name
- Just like
dplyr
, but with different formatting
SELECT year, month, day
FROM surveys;
- We can also do calculations and use functions
- Just like in
dplyr
SELECT year, month, day, species_id, hindfoot_length/10
FROM surveys;
SELECT year, month, day, species_id, ROUND(hindfoot_length/10)
FROM surveys;
Filtering
- The equivalent of
filter
in SQL isWHERE
- Follow
WHERE
with a conditional statement - Unlike in R
=
not==
for equality
- Follow
SELECT year, month, day, species_id, ROUND(hindfoot_length/10)
FROM surveys
WHERE species_id = 'DS';
- To combine two or more conditions use
AND
andOR
.
SELECT year, month, day, species_id, ROUND(hindfoot_length/10)
FROM surveys
WHERE species_id = 'DS' AND year > 1990;
- The red cells are NULL values, in this case instances were no hind foot
measure was taken. We can use
WHERE
to remove them by asking SQL to only give us non-NULL values.
SELECT year, month, day, species_id, ROUND(hindfoot_length/10)
FROM surveys
WHERE species_id = 'DS' AND year > 1990 AND hindfoot_length IS NOT NULL;
Style
- SQL generally doesn’t care about capitalization or line breaks. So it will run a query like this.
- To make the code readable follow style rules for writing SQL code
- Capitalize SQL commands
- Lowercase variable names
- One clause/line
Do the Simple WHERE exercise.
Sorting
- Use
ORDER BY
to sort data. - Equivalent of
arrange
indplyr
isORDER BY
.
SELECT genus, species
FROM species
ORDER BY genus;
- Use
DESC
to sort in descending order.
SELECT genus, species
FROM species
ORDER BY genus DESC;
- Use a list to sort by multiple columns.
SELECT genus, species
FROM species
ORDER BY taxa, genus, species;
Aggregation
- Like in
dplyr
we useGROUP BY
to calculate values for different groups.
SELECT species_id, AVG(weight), COUNT(species_id)
FROM surveys
GROUP BY species_id;
- We can group by multiple columns as well.
SELECT species_id, plot_id, AVG(weight), COUNT(species_id) FROM surveys GROUP BY species_id, plot_id;
- Aggregation functions remove null values from the calculations.
COUNT(species_id)
counts the number of individuals identified to species- To count the number of individuals with weights
SELECT species_id, plot_id, AVG(weight), COUNT(weight) FROM surveys GROUP BY species_id, plot_id;
- Using
*
counts any row with at least one non-null value - We can name aggregated columns using
as
SELECT species_id, plot_id, AVG(weight) as avg_weight, COUNT(weight) as num_indiv FROM surveys GROUP BY species_id, plot_id;
Do the COUNT exercise.
Order matters
Nemonic from: https://twitter.com/statsnam/status/1149431249511075840
Select - So
From - Few
Where - Workers
Group by - Go
Having - Home
Order by - On time
Basic join
- Add genus and species data to surveys table
inner_join
isJOIN
orINNER JOIN
USING
specifies the columns to join on if the tables share column names (likeby
indplyr
)
SELECT *
FROM surveys
JOIN species USING (species_id)
-
If the column names don’t match you can use
ON
instead ofUSING
-
Unlike in
dplyr
you must specify the columns to join on (or things go badly)
SELECT *
FROM surveys
JOIN species
- This just combines every row in
surveys
with every row inspecies
, which isn’t what we want
Multi-table join
- To join multiple tables do multiple joins
SELECT year, month, day, taxa, plot_type
FROM surveys
JOIN species USING (species_id)
JOIN plots USING (plot_id)
Do Basic Join.
Do Multi-table Join.