Today we will…
summarize())This dataset contains a random sample of 1,000 births from North Carolina in 2004 (sampled from a larger dataset).
| fage | mage | mature | weeks | premie | visits | marital | gained | weight | lowbirthweight | gender | habit | whitemom |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | 15 | younger mom | 37 | full term | 11 | not married | 38 | 6.63 | not low | female | nonsmoker | white |
| NA | 28 | younger mom | 38 | full term | 10 | married | 70 | 8.19 | not low | male | nonsmoker | white |
| 55 | 46 | mature mom | 31 | premie | 8 | married | 25 | 4.56 | low | female | nonsmoker | not white |
| 36 | 37 | mature mom | 44 | full term | 10 | married | 25 | 8.94 | not low | male | nonsmoker | white |
| NA | 30 | younger mom | 39 | full term | 10 | not married | 38 | 11.75 | not low | male | nonsmoker | white |
| 26 | 23 | younger mom | 40 | full term | 14 | married | 28 | 8.38 | not low | male | nonsmoker | white |
| 40 | 34 | younger mom | 39 | full term | 15 | married | 20 | 6.38 | not low | male | nonsmoker | not white |
| 27 | 26 | younger mom | 39 | full term | 16 | married | 38 | 8.00 | not low | female | nonsmoker | white |
| 24 | 24 | younger mom | 38 | full term | 17 | not married | 13 | 6.56 | not low | female | nonsmoker | not white |
| 34 | 36 | mature mom | 37 | full term | 14 | married | 40 | 8.81 | not low | male | nonsmoker | white |
In statistical models, we generally have one variable that is the response and one or more variables that are explanatory.
The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”

Characterizing Relationships
How would you characterize this relationship?

Note: Much of what we are doing at this stage involves making judgment calls!
We often assume the value of our response variable is some function of our explanatory variable, plus some random noise.
\[response = f(explanatory) + noise\]
If we assume the relationship between \(x\) and \(y\) takes the form of a linear function…
\[ response = intercept + slope \times explanatory + noise \]
We use the following notation for this model:
Population Regression Model
\(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\)
where \(\varepsilon \sim N(0, \sigma)\) is the random noise.
Fitted Regression Model
\(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\)
where \(\hat{}\) indicates the value was estimated.
Regress baby birthweight (response variable) on the pregnant parent’s weight gain (explanatory variable).
When visualizing data, fit a regression line (\(y\) on \(x\)) to your scatterplot.
Call:
lm(formula = weight ~ gained, data = ncbirths)
Residuals:
Min 1Q Median 3Q Max
-6.4085 -0.6950 0.1643 0.9222 4.5158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.63003 0.11120 59.620 < 2e-16 ***
gained 0.01614 0.00332 4.862 1.35e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.474 on 971 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.02377, Adjusted R-squared: 0.02276
F-statistic: 23.64 on 1 and 971 DF, p-value: 1.353e-06
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 6.6300336 | 0.1112054 | 59.619718 | 0.0e+00 |
| gained | 0.0161405 | 0.0033195 | 4.862253 | 1.4e-06 |
The difference between observed (point) and expected (line).
| .rownames | weight | gained | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|
| 1 | 7.63 | 38 | 7.243372 | 0.3866279 | 0.0013265 | 1.474586 | 0.0000458 | 0.2624942 |
| 2 | 7.88 | 20 | 6.952843 | 0.9271567 | 0.0015686 | 1.474337 | 0.0003113 | 0.6295530 |
| 3 | 6.63 | 38 | 7.243372 | -0.6133721 | 0.0013265 | 1.474506 | 0.0001152 | -0.4164381 |
There are four conditions that must be met for a linear regression model to be appropriate:
Is the relationship linear?

Are the observations independent? Difficult to tell!
What does independence mean?
Should not be able to know the \(y\) value for one observation based on the \(y\) value for another observation.
Independence violations:
Are the residuals normally distributed?
Less important than linearity or independence:
Do the residuals have equal (constant) variance?
Sum of Square Errors (SSE)
Root Mean Square Error (RMSE)
Call:
lm(formula = weight ~ gained, data = ncbirths)
Residuals:
Min 1Q Median 3Q Max
-6.4085 -0.6950 0.1643 0.9222 4.5158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.63003 0.11120 59.620 < 2e-16 ***
gained 0.01614 0.00332 4.862 1.35e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.474 on 971 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.02377, Adjusted R-squared: 0.02276
F-statistic: 23.64 on 1 and 971 DF, p-value: 1.353e-06
R-squared
Regress baby birthweight on…
… gestation weeks.
Why does it make sense that the left model is better?
When fitting a linear regression, you can include…
…multiple explanatory variables.
lm(y ~ x1 + x2 + x3 + ... + xn)
…interaction terms.
lm(y ~ x1 + x2 + x1:x2)
lm(y ~ x1*x2)
…quadratic relationships.
lm(y ~ I(x1^2) + x1)
\[\hat{y}_i = 6.6 + .016x_i\]
where \(\hat{y}_i\) is the estimated birth weight in pounds and \(x_i\) is the weight gained during pregnancy by the birthing parent in pounds.
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 6.6 | 6.4, 6.8 | <0.001 |
| gained | 0.02 | 0.01, 0.02 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
gt package (gtsummary)map2() FamilyThese functions allow us to iterate over two lists at the same time.

Each function has two list arguments, denoted .x and .y, and a function argument.
map2() ExampleFind the minimum.
nest() and unnest()map() family very nicely with two tidyr functions: nest() and unnest().nest()Nest subsets the data (as tibbles) inside a tibble.
# A tibble: 6 × 3
premie habit ph_dat
<fct> <fct> <list>
1 full term nonsmoker <tibble [739 × 11]>
2 premie nonsmoker <tibble [133 × 11]>
3 full term smoker <tibble [107 × 11]>
4 premie smoker <tibble [19 × 11]>
5 <NA> nonsmoker <tibble [1 × 11]>
6 <NA> <NA> <tibble [1 × 11]>
unnest()Un-nest the data by row binding the subsets back together.
# A tibble: 6 × 13
premie fage mage mature weeks visits marital gained weight lowbirthweight
<fct> <int> <int> <fct> <int> <int> <fct> <int> <dbl> <fct>
1 full term NA 13 young… 39 10 not ma… 38 7.63 not low
2 full term NA 14 young… 42 15 not ma… 20 7.88 not low
3 full term 19 15 young… 37 11 not ma… 38 6.63 not low
4 full term 21 15 young… 41 6 not ma… 34 8 not low
5 full term NA 15 young… 39 9 not ma… 27 6.38 not low
6 full term NA 15 young… 38 19 not ma… 22 5.38 low
# ℹ 3 more variables: gender <fct>, habit <fct>, whitemom <fct>
map2() Example - Regression# A tibble: 4 × 3
premie habit ph_dat
<fct> <fct> <list>
1 full term nonsmoker <tibble [724 × 11]>
2 premie nonsmoker <tibble [126 × 11]>
3 full term smoker <tibble [103 × 11]>
4 premie smoker <tibble [19 × 11]>
# A tibble: 4 × 4
premie habit premie_smoke_dat mod
<fct> <fct> <list> <list>
1 full term nonsmoker <tibble [724 × 11]> <lm>
2 premie nonsmoker <tibble [126 × 11]> <lm>
3 full term smoker <tibble [103 × 11]> <lm>
4 premie smoker <tibble [19 × 11]> <lm>
# A tibble: 4 × 5
premie habit premie_smoke_dat mod pred_weight
<fct> <fct> <list> <list> <list>
1 full term nonsmoker <tibble [724 × 11]> <lm> <dbl [724]>
2 premie nonsmoker <tibble [126 × 11]> <lm> <dbl [126]>
3 full term smoker <tibble [103 × 11]> <lm> <dbl [103]>
4 premie smoker <tibble [19 × 11]> <lm> <dbl [19]>
ncbirths_clean |>
nest(premie_smoke_dat = -c(premie, habit)) |>
mutate(mod = map(premie_smoke_dat,
~ lm(weight ~ gained, data = .x)),
pred_weight = map2(.x = mod,
.y = premie_smoke_dat,
.f = ~ predict(object = .x, data = .y))) |>
select(-mod) |>
unnest(cols = c(premie_smoke_dat, pred_weight)) |>
select(premie, habit, weight, gained, pred_weight) |>
head()# A tibble: 6 × 5
premie habit weight gained pred_weight
<fct> <fct> <dbl> <int> <dbl>
1 full term nonsmoker 7.63 38 7.55
2 full term nonsmoker 7.88 20 7.43
3 full term nonsmoker 6.63 38 7.55
4 full term nonsmoker 8 34 7.53
5 full term nonsmoker 6.38 27 7.48
6 full term nonsmoker 5.38 22 7.44
# A tibble: 4 × 5
premie habit premie_smoke_dat mod coefs
<fct> <fct> <list> <list> <list>
1 full term nonsmoker <tibble [724 × 11]> <lm> <tibble [2 × 5]>
2 premie nonsmoker <tibble [126 × 11]> <lm> <tibble [2 × 5]>
3 full term smoker <tibble [103 × 11]> <lm> <tibble [2 × 5]>
4 premie smoker <tibble [19 × 11]> <lm> <tibble [2 × 5]>
# A tibble: 8 × 7
premie habit term estimate std.error statistic p.value
<fct> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 full term nonsmoker (Intercept) 7.29 0.0965 75.5 0
2 full term nonsmoker gained 0.00705 0.00285 2.48 1.34e- 2
3 premie nonsmoker (Intercept) 4.55 0.393 11.6 1.75e-21
4 premie nonsmoker gained 0.0258 0.0137 1.89 6.09e- 2
5 full term smoker (Intercept) 6.77 0.223 30.3 1.71e-52
6 full term smoker gained 0.0121 0.00613 1.98 5.08e- 2
7 premie smoker (Intercept) 5.75 0.880 6.54 5.04e- 6
8 premie smoker gained -0.0320 0.0293 -1.09 2.89e- 1
ncbirths_clean |>
nest(premie_smoke_dat = -c(premie, habit)) |>
mutate(mod = map(premie_smoke_dat,
~ lm(weight ~ gained, data = .x)),
coefs = map(mod,
~ broom::tidy(.x))) |>
select(premie, habit, coefs) |>
unnest(cols = coefs) |>
mutate(term = fct_recode(.f = term,
"Intercept" = "(Intercept)",
"Mother Weight Gain (lb.)" = "gained"
)) |>
select(-std.error, -statistic) |>
mutate(p.value = case_when(p.value < .0001 ~ "<.001",
TRUE ~ as.character(round(p.value, 3)))) |>
arrange(premie, habit, term) |>
gt() |>
fmt_number(estimate,
decimals = 3) |>
tab_row_group(
label = md("**Premature + Smoker**"),
rows = premie == "premie" & habit == "smoker") |>
tab_row_group(
label = md("**Premature + Non-Smoker**"),
rows = premie == "premie" & habit == "nonsmoker") |>
tab_row_group(
label = md("**Full Term + Smoker**"),
rows = premie == "full term" & habit == "smoker") |>
tab_row_group(
label = md("**Full Term + Non-Smoker**"),
rows = premie == "full term" & habit == "nonsmoker") |>
cols_hide(c(premie, habit)) |>
cols_align(align = "left",
columns = term) |>
tab_style(
style = cell_fill(color = "gray85"),
locations = cells_row_groups()) |>
cols_label(
"term" = md("**Model & Term**"),
"estimate" = md("**Est. Coef.**"),
"p.value" = md("**p-value**")
) | Model & Term | Est. Coef. | p-value |
|---|---|---|
| Full Term + Non-Smoker | ||
| Intercept | 7.285 | <.001 |
| Mother Weight Gain (lb.) | 0.007 | 0.013 |
| Full Term + Smoker | ||
| Intercept | 6.767 | <.001 |
| Mother Weight Gain (lb.) | 0.012 | 0.051 |
| Premature + Non-Smoker | ||
| Intercept | 4.549 | <.001 |
| Mother Weight Gain (lb.) | 0.026 | 0.061 |
| Premature + Smoker | ||
| Intercept | 5.754 | <.001 |
| Mother Weight Gain (lb.) | −0.032 | 0.289 |
