Notes B

Wrangling data with dplyr: filter, select, arrange

Author

EMW

Published

June 4, 2024

Importing Data

In this class, we are going to be working with the possum dataset which consists of nine morphometric measurements on each of 104 mountain brushtail possums, trapped at seven Australian sites from Southern Victoria to central Queensland.

Source: Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.

What is a .csv file?

How do we import it into R?

#possum.csv needs to be saved in the same directory as this file. 
#possum <- read.csv("possum.csv")

Alternatively, you can download it directly from the DAAD package.

#You need to install the DAAG package first. Then you can call it using the code below. 
library(DAAG)
data(possum)

Let’s take a look at this data for a minute to see what information has been recorded.

This data frame contains the following columns:

case observation number site one of seven locations where possums were trapped. The sites were, in order,Cambarville, Bellbird, Whian Whian, Byrangery, Conondale, Allyn River and Bulburin Pop a factor which classifies the sites as Vic Victoria, other New South Wales or Queensland sex a factor with levels f female, m male age age hdlngth head length, in cm skullw skull width, in cm totlngth total length, in cm taill tail length, in cm footlgth foot length, in cm earconch ear conch length, in cm eye distance from medial canthus to lateral canthus of right eye, in cm chest chest girth, in cm belly belly girth, in cm

Installing and Using Packages

Sometimes everything we need (data, functions, etc) are not available in base R. In R, expert users will package up useful things like data and functions into packages that be download and used.

First, you need to download the package from the right hand menu –> You only need to do this once.

In each new .Rmd document, you need to call any packages you want to use but adding the code library(packagename) inside an R chunk.

For example, in this class we will use the tidyverse package a lot.

library(tidyverse)

dplyr

There are actually many commonly used packages wrapped up inside one tidyverse package.

Today we are specifically going to be talking about the package dplyr which is useful to manipulating data sets.

filter

We can use the filter function to extract rows from the data that have a particular characteristic.

For example, we may be interested in only looking at female possums.

#start with the possums dataset, the pipe "%" means apply the action on the following line to the previous line. In this case, pick out only the rows were the sex variable is "f" 

 # %>% == and then do this

possum %>%
  filter(sex == "f")
      case site   Pop sex age hdlngth skullw totlngth taill footlgth earconch
C5       2    1   Vic   f   6    92.5   57.6     91.5  36.5     72.5     51.2
C10      3    1   Vic   f   6    94.0   60.0     95.5  39.0     75.4     51.9
C15      4    1   Vic   f   6    93.2   57.1     92.0  38.0     76.1     52.2
C23      5    1   Vic   f   2    91.5   56.3     85.5  36.0     71.0     53.2
C24      6    1   Vic   f   1    93.1   54.8     90.5  35.5     73.2     53.6
C27      8    1   Vic   f   6    94.8   57.6     91.0  37.0     72.7     53.9
C28      9    1   Vic   f   9    93.4   56.3     91.5  37.0     72.4     52.9
C31     10    1   Vic   f   6    91.8   58.0     89.5  37.5     70.9     53.4
C32     11    1   Vic   f   9    93.3   57.2     89.5  39.0     77.2     51.3
C34     12    1   Vic   f   5    94.9   55.6     92.0  35.5     71.7     51.0
C45     17    1   Vic   f   1    94.7   67.7     89.5  36.5     73.2     53.2
C48     19    1   Vic   f   5    94.4   55.4     90.5  35.0     73.4     53.9
C50     20    1   Vic   f   4    94.8   56.3     89.0  38.0     73.8     52.4
C54     21    1   Vic   f   3    95.9   58.1     96.5  39.5     77.9     52.9
C58     23    1   Vic   f   4    92.5   56.1     89.0  36.0     72.8     53.3
C63     27    1   Vic   f   2    90.5   54.5     85.0  35.0     70.3     50.8
A1      29    1   Vic   f   3    92.8   56.0     88.0  35.0     74.9     51.8
A2      30    1   Vic   f   2    92.1   54.4     84.0  33.5     70.6     50.8
A4      32    1   Vic   f   4    94.3   56.7     94.0  39.0     74.8     52.0
BB17    37    2   Vic   f   2    89.3   54.8     82.5  35.0     71.2     52.0
BB31    39    2   Vic   f   1    84.7   51.5     75.0  34.0     68.7     53.4
BB33    40    2   Vic   f   3    91.0   55.0     84.5  36.0     72.8     51.4
BB36    41    2   Vic   f   5    88.4   57.0     83.0  36.5       NA     40.3
BB40    43    2   Vic   f   2    90.0   55.5     81.0  32.0     72.0     49.4
WW4     50    3 other   f   5    91.6   56.4     88.0  38.0     65.0     47.2
WW5     51    3 other   f   5    95.6   59.6     85.0  36.0     64.0     43.9
WW7     53    3 other   f   3    93.1   58.1     91.0  38.0     67.4     46.0
BR4     57    4 other   f   4    95.1   59.4     93.0  41.0     67.2     45.3
BR7     60    4 other   f   2    91.3   57.7     88.0  39.0     63.1     47.0
CD2     62    5 other   f   3    91.3   58.0     90.5  39.0     65.5     41.3
CD3     63    5 other   f   6    92.0   56.4     88.5  38.0     64.1     46.3
CD4     64    5 other   f   3    96.9   56.5     89.5  38.5     63.0     45.1
CD5     65    5 other   f   5    93.5   57.4     88.5  38.0     68.2     41.7
CD6     66    5 other   f   3    90.4   55.8     86.0  36.5     63.2     44.2
CD10    70    5 other   f   7    91.9   56.4     87.0  38.0     65.4     44.1
BSF1    74    6 other   f   4    88.7   52.0     83.0  38.0     61.5     45.9
BSF9    82    6 other   f   4    86.0   54.0     82.0  36.5     60.7     42.9
BSF10   83    6 other   f   3    90.0   53.8     81.5  36.0     62.0     43.3
BSF13   86    6 other   f   3    88.2   53.2     86.5  38.5     60.3     43.7
BTP3    88    7 other   f   2    89.6   58.0     87.5  38.0     66.7     43.5
BTP15   99    7 other   f   3    93.3   56.2     86.5  38.5     64.8     43.8
BTP19  102    7 other   f   6    92.4   55.0     89.0  38.0     63.5     45.4
BTP21  104    7 other   f   3    93.6   59.9     89.0  40.0     67.6     46.0
       eye chest belly
C5    16.0  28.5  33.0
C10   15.5  30.0  34.0
C15   15.2  28.0  34.0
C23   15.1  28.5  33.0
C24   14.2  30.0  32.0
C27   14.5  29.0  34.0
C28   15.5  28.0  33.0
C31   14.4  27.5  32.0
C32   14.9  31.0  34.0
C34   15.3  28.0  33.0
C45   14.7  29.0  31.0
C48   15.2  28.0  32.0
C50   15.5  27.0  36.0
C54   14.2  30.0  40.0
C58   15.4  28.0  35.0
C63   14.2  23.0  28.0
A1    14.0  24.0  32.0
A2    14.5  24.5  33.0
A4    14.9  28.0  34.0
BB17  13.6  28.0  31.5
BB31  13.0  25.0  25.0
BB33  13.6  27.0  30.0
BB36  15.9  27.0  30.5
BB40  13.4  29.0  31.0
WW4   14.9  28.0  36.0
WW5   17.4  28.0  38.5
WW7   16.5  26.0  33.5
BR4   14.5  31.0  39.0
BR7   14.4  26.0  30.0
CD2   16.0  27.0  32.0
CD3   15.2  25.5  28.5
CD4   17.1  25.5  33.0
CD5   14.0  29.0  38.5
CD6   15.7  26.5  34.0
CD10  13.0  27.0  34.0
BSF1  14.7  26.0  34.0
BSF9  15.4  26.0  32.0
BSF10 14.0  25.0  29.0
BSF13 13.6  26.0  31.0
BTP3  16.0  25.5  31.5
BTP15 14.0  28.0  35.0
BTP19 13.0  25.0  30.0
BTP21 14.8  28.5  33.5
##note the sex is text/categorical and so quotation marks are needed. 
##R doesn't care about whether they are double quotation marks (") or single ('). They work the same. 
# If we don't assign it to an object, then it just prints out for us to see! 
#oftentimes, we want to take our subset and give it a new name. This takes our subset and assigns it to a new dataset called `possum_f`. 

possum_f <- possum %>%
  filter(sex == "f")

#Notice if you assign it to an object that it doesn't print out the contents. 
# You'll see the new object in your environment on the top right ---> 
# If you click on the word `possum_f` (not the blue play button) it will open the object so you can see what is saved inside. 

It can also be used with numeric criteria.

Suppose we want a list of all the possums whose total length is less than 80cm.

possum %>% 
  filter(totlngth < 80)
     case site Pop sex age hdlngth skullw totlngth taill footlgth earconch  eye
BB31   39    2 Vic   f   1    84.7   51.5       75  34.0     68.7     53.4 13.0
BB38   42    2 Vic   m   3    85.3   54.1       77  32.0     62.7     51.2 13.8
BB41   44    2 Vic   m  NA    85.1   51.5       76  35.5     70.3     52.6 14.4
     chest belly
BB31  25.0    25
BB38  25.5    33
BB41  23.0    27

The logical operators are given below:

select

select is used to extract only certain columns. For example, perhaps we only want to print out a dataset with only the population, the sex, and the total length of each possum.

# %>% 
#MAC (CMD + Shift + M)
#PC (Ctrl + Shift + M)

possum %>% 
  select(Pop, sex, totlngth)
        Pop sex totlngth
C3      Vic   m     89.0
C5      Vic   f     91.5
C10     Vic   f     95.5
C15     Vic   f     92.0
C23     Vic   f     85.5
C24     Vic   f     90.5
C26     Vic   m     89.5
C27     Vic   f     91.0
C28     Vic   f     91.5
C31     Vic   f     89.5
C32     Vic   f     89.5
C34     Vic   f     92.0
C36     Vic   m     89.5
C37     Vic   m     91.5
C39     Vic   m     85.5
C40     Vic   m     86.0
C45     Vic   f     89.5
C47     Vic   m     90.0
C48     Vic   f     90.5
C50     Vic   f     89.0
C54     Vic   f     96.5
C55     Vic   m     91.0
C58     Vic   f     89.0
C59     Vic   m     84.0
C60     Vic   m     91.5
C61     Vic   m     90.0
C63     Vic   f     85.0
C64     Vic   m     87.0
A1      Vic   f     88.0
A2      Vic   f     84.0
A3      Vic   m     93.0
A4      Vic   f     94.0
AD1     Vic   m     89.0
BB4     Vic   m     85.5
BB13    Vic   m     85.0
BB15    Vic   m     88.0
BB17    Vic   f     82.5
BB25    Vic   m     80.5
BB31    Vic   f     75.0
BB33    Vic   f     84.5
BB36    Vic   f     83.0
BB38    Vic   m     77.0
BB40    Vic   f     81.0
BB41    Vic   m     76.0
BB44    Vic   m     81.0
BB45    Vic   m     84.0
WW1   other   m     89.0
WW2   other   m     85.0
WW3   other   m     85.0
WW4   other   f     88.0
WW5   other   f     85.0
WW6   other   m     93.5
WW7   other   f     91.0
BR1   other   m     91.5
BR2   other   m     92.5
BR3   other   m     93.7
BR4   other   f     93.0
BR5   other   m     91.0
BR6   other   m     96.0
BR7   other   f     88.0
CD1   other   m     86.0
CD2   other   f     90.5
CD3   other   f     88.5
CD4   other   f     89.5
CD5   other   f     88.5
CD6   other   f     86.0
CD7   other   m     85.0
CD8   other   m     88.5
CD9   other   m     88.0
CD10  other   f     87.0
CD11  other   m     90.0
CD12  other   m     80.5
CD13  other   m     82.0
BSF1  other   f     83.0
BSF2  other   m     89.0
BSF3  other   m     89.0
BSF4  other   m     84.0
BSF5  other   m     81.0
BSF6  other   m     81.0
BSF7  other   m     84.0
BSF8  other   m     85.5
BSF9  other   f     82.0
BSF10 other   f     81.5
BSF11 other   m     80.5
BSF12 other   m     92.0
BSF13 other   f     86.5
BTP1  other   m     93.0
BTP3  other   f     87.5
BTP4  other   m     84.5
BTP5  other   m     85.0
BTP6  other   m     89.0
BTP7  other   m     85.0
BTP8  other   m     82.0
BTP9  other   m     84.0
BTP10 other   m     88.5
BTP12 other   m     83.0
BTP13 other   m     86.0
BTP14 other   m     84.0
BTP15 other   f     86.5
BTP16 other   m     81.5
BTP17 other   m     82.5
BTP19 other   f     89.0
BTP20 other   m     82.5
BTP21 other   f     89.0

We can combine criteria together as well in one command with multiple pipes:

possum %>% 
  filter(sex == "f") %>% 
  filter(totlngth < 80) %>% 
  select(Pop, sex, age, totlngth)
     Pop sex age totlngth
BB31 Vic   f   1       75

arrange

The arrange function allows us to order the rows of the data frame by the values of a particular column.

For example, arrange all the female possums by the longest length to the shortest total length

possum_f %>% 
  arrange(-totlngth)
      case site   Pop sex age hdlngth skullw totlngth taill footlgth earconch
C54     21    1   Vic   f   3    95.9   58.1     96.5  39.5     77.9     52.9
C10      3    1   Vic   f   6    94.0   60.0     95.5  39.0     75.4     51.9
A4      32    1   Vic   f   4    94.3   56.7     94.0  39.0     74.8     52.0
BR4     57    4 other   f   4    95.1   59.4     93.0  41.0     67.2     45.3
C15      4    1   Vic   f   6    93.2   57.1     92.0  38.0     76.1     52.2
C34     12    1   Vic   f   5    94.9   55.6     92.0  35.5     71.7     51.0
C5       2    1   Vic   f   6    92.5   57.6     91.5  36.5     72.5     51.2
C28      9    1   Vic   f   9    93.4   56.3     91.5  37.0     72.4     52.9
C27      8    1   Vic   f   6    94.8   57.6     91.0  37.0     72.7     53.9
WW7     53    3 other   f   3    93.1   58.1     91.0  38.0     67.4     46.0
C24      6    1   Vic   f   1    93.1   54.8     90.5  35.5     73.2     53.6
C48     19    1   Vic   f   5    94.4   55.4     90.5  35.0     73.4     53.9
CD2     62    5 other   f   3    91.3   58.0     90.5  39.0     65.5     41.3
C31     10    1   Vic   f   6    91.8   58.0     89.5  37.5     70.9     53.4
C32     11    1   Vic   f   9    93.3   57.2     89.5  39.0     77.2     51.3
C45     17    1   Vic   f   1    94.7   67.7     89.5  36.5     73.2     53.2
CD4     64    5 other   f   3    96.9   56.5     89.5  38.5     63.0     45.1
C50     20    1   Vic   f   4    94.8   56.3     89.0  38.0     73.8     52.4
C58     23    1   Vic   f   4    92.5   56.1     89.0  36.0     72.8     53.3
BTP19  102    7 other   f   6    92.4   55.0     89.0  38.0     63.5     45.4
BTP21  104    7 other   f   3    93.6   59.9     89.0  40.0     67.6     46.0
CD3     63    5 other   f   6    92.0   56.4     88.5  38.0     64.1     46.3
CD5     65    5 other   f   5    93.5   57.4     88.5  38.0     68.2     41.7
A1      29    1   Vic   f   3    92.8   56.0     88.0  35.0     74.9     51.8
WW4     50    3 other   f   5    91.6   56.4     88.0  38.0     65.0     47.2
BR7     60    4 other   f   2    91.3   57.7     88.0  39.0     63.1     47.0
BTP3    88    7 other   f   2    89.6   58.0     87.5  38.0     66.7     43.5
CD10    70    5 other   f   7    91.9   56.4     87.0  38.0     65.4     44.1
BSF13   86    6 other   f   3    88.2   53.2     86.5  38.5     60.3     43.7
BTP15   99    7 other   f   3    93.3   56.2     86.5  38.5     64.8     43.8
CD6     66    5 other   f   3    90.4   55.8     86.0  36.5     63.2     44.2
C23      5    1   Vic   f   2    91.5   56.3     85.5  36.0     71.0     53.2
C63     27    1   Vic   f   2    90.5   54.5     85.0  35.0     70.3     50.8
WW5     51    3 other   f   5    95.6   59.6     85.0  36.0     64.0     43.9
BB33    40    2   Vic   f   3    91.0   55.0     84.5  36.0     72.8     51.4
A2      30    1   Vic   f   2    92.1   54.4     84.0  33.5     70.6     50.8
BB36    41    2   Vic   f   5    88.4   57.0     83.0  36.5       NA     40.3
BSF1    74    6 other   f   4    88.7   52.0     83.0  38.0     61.5     45.9
BB17    37    2   Vic   f   2    89.3   54.8     82.5  35.0     71.2     52.0
BSF9    82    6 other   f   4    86.0   54.0     82.0  36.5     60.7     42.9
BSF10   83    6 other   f   3    90.0   53.8     81.5  36.0     62.0     43.3
BB40    43    2   Vic   f   2    90.0   55.5     81.0  32.0     72.0     49.4
BB31    39    2   Vic   f   1    84.7   51.5     75.0  34.0     68.7     53.4
       eye chest belly
C54   14.2  30.0  40.0
C10   15.5  30.0  34.0
A4    14.9  28.0  34.0
BR4   14.5  31.0  39.0
C15   15.2  28.0  34.0
C34   15.3  28.0  33.0
C5    16.0  28.5  33.0
C28   15.5  28.0  33.0
C27   14.5  29.0  34.0
WW7   16.5  26.0  33.5
C24   14.2  30.0  32.0
C48   15.2  28.0  32.0
CD2   16.0  27.0  32.0
C31   14.4  27.5  32.0
C32   14.9  31.0  34.0
C45   14.7  29.0  31.0
CD4   17.1  25.5  33.0
C50   15.5  27.0  36.0
C58   15.4  28.0  35.0
BTP19 13.0  25.0  30.0
BTP21 14.8  28.5  33.5
CD3   15.2  25.5  28.5
CD5   14.0  29.0  38.5
A1    14.0  24.0  32.0
WW4   14.9  28.0  36.0
BR7   14.4  26.0  30.0
BTP3  16.0  25.5  31.5
CD10  13.0  27.0  34.0
BSF13 13.6  26.0  31.0
BTP15 14.0  28.0  35.0
CD6   15.7  26.5  34.0
C23   15.1  28.5  33.0
C63   14.2  23.0  28.0
WW5   17.4  28.0  38.5
BB33  13.6  27.0  30.0
A2    14.5  24.5  33.0
BB36  15.9  27.0  30.5
BSF1  14.7  26.0  34.0
BB17  13.6  28.0  31.5
BSF9  15.4  26.0  32.0
BSF10 14.0  25.0  29.0
BB40  13.4  29.0  31.0
BB31  13.0  25.0  25.0
#use arrange(variable) to go from least to most
#use arrange(desc(variable)) to go from most to least, arrange(-variable) also works

slice

The slice function will allow us to pick only a subset of the rows based on their numeric order (1st through last).

For example, if I want a list the 5 female possums with longest total lengths.

possum_f %>% 
  arrange(-totlngth) %>% 
  slice(1:5)
    case site   Pop sex age hdlngth skullw totlngth taill footlgth earconch
C54   21    1   Vic   f   3    95.9   58.1     96.5  39.5     77.9     52.9
C10    3    1   Vic   f   6    94.0   60.0     95.5  39.0     75.4     51.9
A4    32    1   Vic   f   4    94.3   56.7     94.0  39.0     74.8     52.0
BR4   57    4 other   f   4    95.1   59.4     93.0  41.0     67.2     45.3
C15    4    1   Vic   f   6    93.2   57.1     92.0  38.0     76.1     52.2
     eye chest belly
C54 14.2    30    40
C10 15.5    30    34
A4  14.9    28    34
BR4 14.5    31    39
C15 15.2    28    34