Notes H

forcats: for working with categorical variables

Author

EMW

Published

June 7, 2024

The R package forcats is designed to make working with categorical variables easier and more efficient. It provides a set of functions that allow you to manipulate and analyze categorical data with ease. In this lesson, we’ll cover the basics of the forcats package and some of its most useful functions.

Categorical Variables

Let’s review what categorical data is. Categorical data is a type of data that consists of categories or labels.

Examples of categorical data include:

  • Colors (red, blue, green, etc.)
  • Types of vehicles (sedan, SUV, truck)
  • Educational degrees (high school, college, graduate school)

Categorical data can be further divided into two types: nominal and ordinal. Nominal data consists of categories that have no inherent order, while ordinal data consists of categories that have a natural order. For example, educational degrees are ordinal data because they can be ordered from least to most advanced.

# load packages
library(tidyverse)
library(forcats)

Recall: portal Rodent Data

# load data
portal_rodent <- read.csv("https://github.com/weecology/PortalData/raw/main/Rodents/Portal_rodent.csv")
portal_species <- read.csv("https://raw.githubusercontent.com/weecology/PortalData/main/Rodents/Portal_rodent_species.csv")

Check out Kangaroo Rats!

There are 4 different kids of kangaroo rats in this dataset:

  • Merriam’s kangaroo rat (DM)
  • Ord’s kangaroo rat (DO)
  • Kangaroo rat (DX)
  • Banner-tailed kangaroo rat (DS)

Create a dataset called kangaroo_rat which has all the data on any kangaroo rats caught in Portal, AZ. You should display the time the rat was captured (year, month, day), the species scientific and common name, and the sex and hindfoot length.

kangaroo_rat <- portal_rodent %>% 
  filter(species %in% c("DM", "DO", "DX", "DS")) %>% 
  left_join(portal_species, by = c("species" = "speciescode")) %>% 
  select(year, month, day, species, scientificname, commonname, sex, hfl)

Create a line graph showing the number of each species caught in each year. Don’t worry about how pretty the labeling is.

kangaroo_rat %>% 
  group_by(year, commonname) %>% 
  summarize(count=n()) %>% 
  ggplot(aes(x=year, y=count, color=commonname)) + 
  geom_line() + 
  #OPTIONAL
  theme_minimal()

Create a side-by-side boxplot showing the distribution of hindfoot length between the 4 different species of Kangaroo Rat. Don’t worry about how pretty the labeling is.

kangaroo_rat %>% 
  ggplot(aes(x=commonname, y=hfl)) + 
  geom_boxplot() + 
  #OPTIONAL + 
  theme_minimal()

Notice that:

  • species, commonname, and sex are all categorical variables or factors
  • we notice the factors always display alphabetically.

Reordering Factor Levels

One of the most useful functions is fct_relevel(), which allows you to reorder the levels of a factor. This can be useful when you want to change the default ordering of the levels or when you want to group certain levels together.

Is commonname a factor? First we need the variables in a very specific factor format (not chr)

kangaroo_rat %>% 
    select(commonname) %>% 
    is.factor()
[1] FALSE

Let’s make it a factor!

kangaroo_rat <- kangaroo_rat %>% 
  #mutate(species = as.factor(species)) %>% 
  mutate(commonname = as.factor(commonname)) #%>% 
  #mutate(sex = as.factor(sex))

Let’s check the levels and their current ordering!

kangaroo_rat %>% 
  select(commonname) %>% 
  levels()
NULL
levels(kangaroo_rat$commonname)
[1] "Banner-tailed kangaroo rat" "Kangaroo rat"              
[3] "Merriam's kangaroo rat"     "Ord's kangaroo rat"        

To reorder the levels:

kangaroo_rat  <- kangaroo_rat  %>% 
  mutate(commonname = fct_relevel(commonname, "Merriam's kangaroo rat", "Ord's kangaroo rat" , "Banner-tailed kangaroo rat", "Kangaroo rat"))

levels(kangaroo_rat$commonname)
[1] "Merriam's kangaroo rat"     "Ord's kangaroo rat"        
[3] "Banner-tailed kangaroo rat" "Kangaroo rat"              

Let’s recreate our boxplot now:

kangaroo_rat %>% 
  ggplot(aes(x=commonname, y=hfl)) + 
  geom_boxplot() + 
  #OPTIONAL + 
  theme_minimal()

Rather than reordering them manually by typing the order, you could also re-level by some numeric criteria. For example:

kangaroo_rat <- kangaroo_rat %>% 
  mutate(commonname = fct_reorder(commonname, hfl, median))

kangaroo_rat %>% 
  ggplot(aes(x=commonname, y=hfl)) + 
  geom_boxplot() + 
  #OPTIONAL + 
  theme_minimal()

Renaming Factor levels

Sometimes you might not like the way the levels are named. For example, maybe you feel it’s bit redundant to have “kangaroo rat” written all the time.

# "newname" = "oldname"

kangaroo_rat <- kangaroo_rat %>% 
  mutate(commonname = fct_recode(commonname, "Merriam's"= "Merriam's kangaroo rat", "Ord's" = "Ord's kangaroo rat" , "Banner-tailed" = "Banner-tailed kangaroo rat", "Not specified" = "Kangaroo rat" ))

levels(kangaroo_rat$commonname)
[1] "Not specified" "Merriam's"     "Ord's"         "Banner-tailed"