#LOAD PACKAGES
library(tidyverse)
library(infer)
library(broom)
set.seed(999)
Notes L
Exploratory Data Analysis & Simple Linear Regression
Let’s this about some ways we can summarize and explore relationships between two variables:
Today, we are going to dive into the “two numeric variables” case.
Motivation using data we have seen before
Sustainability Data
#LOAD DATA
<- read.csv("../data/power.csv") power
%>%
power drop_na() %>%
ggplot(aes(x=windSpeedRadarTower_m_per_sec, y=power_kW)) +
geom_point(pch=16, alpha=0.5, position="jitter") +
geom_smooth(method=lm, se=FALSE) +
labs(x = "Wind Speed (m/s)", y = "Power (kW)") +
theme_minimal()
Correlation
Correlation always takes on values between -1 and 1, describes the strength of the linear relationship between two variables.
<- kangaroo_rats %>%
kangaroo_rats drop_na(hfl) %>%
drop_na(wgt)
#cor(x, y)
cor(kangaroo_rats$hfl, kangaroo_rats$wgt)
[1] 0.4741858
Correlation Does Not Imply Causation
Equations and Interpretation of Coefficients
Want to fit a linear model:
\[ \hat{y} = b_0 + b_1x \]
To fit a linear regression model we need the lm()
function.
#modelname <- lm(y ~ x, data = dataset)
<- lm(power_kW ~ windSpeedRadarTower_m_per_sec,data=power)
power_model
# calling the name of the model spits out the coefficients (slope and y-intercept)
power_model
Call:
lm(formula = power_kW ~ windSpeedRadarTower_m_per_sec, data = power)
Coefficients:
(Intercept) windSpeedRadarTower_m_per_sec
-2.5469 0.8558
To interpret the y-intercept coefficient (\(b_0\)):
- If the wind speed is 0, this model predicts that the power generated will be -2.5 kW on average.
To interpret the slope coefficient (\(b_1\)):
- For every additional m per sec of wind speed, this model predicts that the power generated will increase by 0.8558 kW.
Variability of the Slope Estimates
- we are usually using a sample to try to estimate was is going on with the population
- statistics which summarize sample data (like slope) vary from sample to sample
Consider the Banner-tailed kangaroo rats. The full dataset is 2575 kangaroo rats, but this is only a subset of all kangaroo rats in the world.
Suppose we only had a sample of 500.
<- kangaroo_rats %>%
kangaroo_rats_sample1 sample_n(500)
%>%
kangaroo_rats_sample1 ggplot(aes(x=hfl, y=wgt)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
<- lm(wgt ~ hfl, data=kangaroo_rats_sample1 )
kangaroo_model1
kangaroo_model1
Call:
lm(formula = wgt ~ hfl, data = kangaroo_rats_sample1)
Coefficients:
(Intercept) hfl
-154.283 5.464
If we repeat this, we will get a slightly different slope.
<- kangaroo_rats %>%
kangaroo_rats_sample2 sample_n(500)
%>%
kangaroo_rats_sample2 ggplot(aes(x=hfl, y=wgt)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
<- lm(wgt ~ hfl, data=kangaroo_rats_sample2 )
kangaroo_model2
kangaroo_model2
Call:
lm(formula = wgt ~ hfl, data = kangaroo_rats_sample2)
Coefficients:
(Intercept) hfl
-161.862 5.651
Let’s use some fancy R things to repeat this sampling process many times:
<- kangaroo_rats %>%
kangaroo_rats_many rep_sample_n(size = 20, replace = FALSE, reps = 500)
ggplot() +
# plot the slopes of all the replicates
geom_smooth(
data = kangaroo_rats_many, aes(x = hfl, y = wgt, group = replicate),
method = "lm", se = FALSE, fullrange = TRUE,
color = "grey", linewidth = 0.1
+
) # plot the slope of all kangaroo rats in Portal, AZ
geom_smooth(
data = kangaroo_rats, aes(x = hfl, y = wgt), method = "lm", se = FALSE,
fullrange = TRUE, color = "blue"
+
) ggtitle("Repeated Samples of size n=500") +
theme_minimal()
- You might notice in the above figure that the \(\hat{y}\) values given by the lines are much more consistent in the middle of the dataset than at the ends. The reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud.
<- kangaroo_rats_many %>%
kangaroo_rats_many_lm group_by(replicate) %>%
do(tidy(lm(wgt ~ hfl, data = .))) %>%
filter(term == "hfl")
ggplot(kangaroo_rats_many_lm, aes(x = estimate)) +
geom_histogram(binwidth=0.5) +
labs(
x = "Slope estimate",
y = "Count",
title = "Banner-tailed Kangaroo Rats",
subtitle = "Many random samples of 500 rats"
+
) theme_minimal()
Understanding how much the slope varies from sample to sample helps us better understand a range of likely values for the true slope!
(Take a stats class to learn more!)
Assumptions, Residuals, and Residual Plots
There are some!
Interpolation vs. Extrapolation
Interpolation is a method of estimating a hypothetical value that exists within a data set.
Extrapolation is a method of estimation for hypothetical values that fall outside a data set.