9 Linear Regression

Reading: 4 minute(s) at 200 WPM

Videos: 31 minutes

Objectives

Fit a simple linear regression model in R
Pull out relevant model information using the broom package

This chapter is heavily from Dr. Theobold’s course-page material.

9.1 Simple Linear Regression

You now have the skills to import, wrangle, and visualize data. All of these tools help us prepare our data for statistical modeling. While we have sprinkled some formal statistical analyses throughout the course, in this section we will be formally reviewing Linear Regression. First let’s review simple linear regression. Linear regression models the linear relationship between two quantitative variables.

Review of Simple Linear Regression and Conditions

Handy function shown in the reading! skim from the skimr package.

9.2 Linear Regression in R

To demonstrate linear regression in R, we will be working with the penguins data set.

library(palmerpenguins)
data(penguins)
head(penguins) |> 
  kable()

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

When conducting linear regression with tools in R, we often want to visualize the relationship between the two quantitative variables of interest with a scatterplot. We can then use either geom_smooth(method = "lm") (or equivalently stat_smooth(method = "lm") to add a line of best fit (“regression line”) based on the ordinary least squares (OLS) equation to our scatter plot. The regression line is shown in a default blue line with the standard error uncertainty displayed in a gray transparent band (use se = FALSE to hide the standard error uncertainty band). These visual aesthetics can be changed just as any other plot aesthetics.

penguins |>
  ggplot(aes(x = bill_depth_mm, 
             y = bill_length_mm
             )
         ) +
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(x = "Bill Depth (mm)",
       subtitle = "Bill Length (mm)",
       title = "Relationship between penguin bill length and depth"
       ) +
  theme(axis.title.y = element_blank())

Be careful of “overplotting” and use geom_jitter() instead of geom_point() if your data set is dense. This is strictly a data visualization tool and will not alter the original values.

In simple linear regression, we can define the linear relationship with a mathematical equation given by:

\[y = a + b\cdot x\]

Remember \(y = m\cdot x+b\) from eighth grade?!

where

\(y\) are the values of the response variable,
\(x\) are the values of the explanatory/predictor variable,
\(a\) is the \(y\)-intercept (average value of \(y\) when \(x = 0\)), and
\(b\) is the slope coefficient (for every 1 unit increase in \(x\), the average of \(y\) increases by b)

Remember “rise over run”!

In statistics, we use slightly different notation to denote this relationship with the estimated linear regression equation:

\[\hat y = b_0 + b_1\cdot x.\]

Note that the “hat” symbol above our response variable indicates this is an “estimated” value (or our best guess).

We can “fit” the linear regression equation with the lm function in R. The formula argument is denoted as y ~ x where the left hand side (LHS) is our response variable and the right hand side (RHS) contains our explanatory/predictor variable(s). We indicate the data set with the data argument and therefore use the variable names (as opposed to vectors) when defining our formula. We name (my_model) and save our fitted model just as we would any other R object.

my_model <- lm(bill_length_mm ~ bill_depth_mm, 
               data = penguins
               )

Now that we have fit our linear regression, we might be wondering how we actually get the information out of our model. What are the y-intercept and slope coefficient estimates? What is my residual? How good was the fit? The code options below help us obtain this information.

This is what is output when you just call the name of the linear model object you created (my_model). Notice, the output doesn’t give you much information and it looks kind of bad.

my_model


Call:
lm(formula = bill_length_mm ~ bill_depth_mm, data = penguins)

Coefficients:
  (Intercept)  bill_depth_mm  
      55.0674        -0.6498

This is what is output when you use the summary() function on a linear model object. Notice, the output gives you a lot of information, some of which is really not that useful. And, the output is quite messy!

summary(my_model)


Call:
lm(formula = bill_length_mm ~ bill_depth_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.8949  -3.9042  -0.3772   3.6800  15.5798 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    55.0674     2.5160  21.887  < 2e-16 ***
bill_depth_mm  -0.6498     0.1457  -4.459 1.12e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.314 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.05525,   Adjusted R-squared:  0.05247 
F-statistic: 19.88 on 1 and 340 DF,  p-value: 1.12e-05

The tidy() function from the {broom} package takes a linear model object and puts the “important” information into a tidy tibble output.

Ah! Just right!

library(broom)
tidy(my_model) |> 
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	55.0673698	2.5159514	21.887295	0.00e+00
bill_depth_mm	-0.6498356	0.1457327	-4.459093	1.12e-05

If you are sad that you no longer have the statistics about the model fit (e.g., R-squared, adjusted R-squared, \(\sigma\)), you can use the glance() function from the broom package to grab those!

broom::glance(my_model) |> 
  kable()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.0552499	0.0524712	5.314418	19.88351	1.12e-05	1	-1055.559	2117.118	2128.622	9602.635	340	342

Objectives

9.1 Simple Linear Regression

Review of Simple Linear Regression and Conditions

9.2 Linear Regression in R

References