Model Validation + Geospacial Graphics

Monday, June 1

Today we will…

Group Quiz 9
New Material:
- Model Validation
- Graphics with Geospacial Data (Maps)
Work Time
- PA 10: Map it!
- Final Project Submission

Final Project Submission

Model Validation

How do we tell if a model is “good”?

There are lots of different metrics to measure model performance!
- Your choice depends on the type of model and what is “good” for the context
- Do you only care about predictions? Or more about inference?

Regression
Classification (binary outcomes)

See week 9 slides
RMSE
\(R^2\)

overall prediction success or failure rate
sensitivity: true positive rate
specificity: true negative rater
many more…

Overfitting and Underfitting

From Introduction to Statistical Modeling

Big Question for Model Validation

Even if a model is “good” according to our metric with the data we have, how do we know if it will still work well with other data?

Train / Test Split

Big idea: Reserve part of your data for testing to get an idea of how the model would do on outside data

Split your data into a training and a testing set (typically 80%/20%)
Set the testing set aside
Do all the model development you want with the training data
Fit a final model with the training data
Generate predictions from that model with the testing data and calculate your performance metric
Compare the testing and training metrics

Comparing Train / Test Performance

If the model is overfit…
- test performance < training performance
If the model is underfit …
- test performance > training performance
If the model is neither over nor underfit …
- test performance \(\approx\) training performance

\(k\)-fold Cross Validation

Iterates over \(k\) train / test splits
Uses all of the data
Especially useful for model development
- covariate selection
- specifying parameters for ML models

\(k\)-fold Cross Validation Process

Choose a value for \(k\)
Split data into \(k\) folds
For fold \(i\) from 1 … \(k\):
1. Fit model on observations not in fold \(i\)
2. Generate predictions from this model for the observations in fold \(i\)
3. Calculate and save performance metric of interest
Average the \(k\) performance metrics across folds

\(k\)-fold Cross Validation Process

Visualization of 5-fold CV by Joshua Ebner

\(k\)-fold Cross Validation Process - details

0.Choose a value for \(k\)

Typical values are 5 or 10

k <- 5

\(k\)-fold Cross Validation Process - details

Split data into \(k\) folds

Should be approximately the same size
You could just split the data into \(k\) groups based on their row number

n <- nrow(ncbirths) 

ncbirths <- ncbirths |> 
  mutate(fold_cut = cut(1:n, breaks = k, labels = FALSE))

More typical to assign groups randomly

set.seed(394)

ncbirths <- ncbirths |> 
  mutate(fold_random = sample(rep_len(1:k, length.out = n),
                       size = n))

\(k\)-fold Cross Validation Process - details

We implemented 5-fold CV for the model of birth weight on gestation weeks with the NC births data…

cv_r2

[1] 0.3412554 0.4472941 0.5447437 0.5104462 0.2991063

Average the \(k\) performance metrics across folds

mean(cv_r2)

[1] 0.4285691

The \(R^2\) from fitting the model on the full dataset was 0.449, so it appears the model is neither overfitting or underfitting (maybe slightly overfitting).

Actually… should combine

A first (not ideal) attempt at k-fold CV

Code

# write a function to calculate the R2
r2_calc <- function(obs, pred){
  
  sse <- sum((obs - pred)^2, na.rm = T)
  sst <- sum((obs - mean(obs, na.rm = T))^2, na.rm = T)
  
  1 - sse/sst

}

Code

cv_r2 <- rep(NA, k)

# for each fold 1-k...
for(x in 1:k){
  
  # separate fold (test) data
  fold_dat <- ncbirths |> 
    filter(fold_random == x)
  
  # and training data
  train_dat <- ncbirths |> 
    filter(fold_random != x)
  
  # fit model with training data
  it_lm <- lm(weight ~ weeks, 
               data = train_dat)
  
  # generate predictions for the held-out fold data
  fold_preds <- predict(it_lm, newdata = fold_dat)

  # calculate R2 for the held-out fold data and save it
  cv_r2[x] <- r2_calc(obs = fold_dat$weight,
                      pred = fold_preds)
  
}

Not great!! This only works for the exact data and model that I was working with AND uses for-loops 🫣

Let’s write a general CV function to make this efficient!

Inputs:
- a dataframe
- formula for the desired model
- number of folds (\(k\))
Output:
- vector of \(k\) \(R^2\) values

Important

You may need this function on the final exam…

Graphics with Geospacial Data (Maps)

We see geospacial graphics all the time

https://www.npr.org/sections/health-shots/2020/07/01/885263658/green-yellow-orange-or-red-this-new-tool-shows-covid-19-risk-in-your-county

Plotting geospacial data can uncover patterns that would be hard to determine through other analyses …

https://hpcf-files.umbc.edu/research/papers/REU2015Team2.pdf

… It can also help make grouping of observations in your analysis clear!

https://pmc.ncbi.nlm.nih.gov/articles/PMC11180987/

ArcGIS doesn’t get to have all the fun

There are now many tools in R to plot geospacial data
maps / mapdata + geom_polygon()
- pros: simplest way to map the US counties / states and world countries
- cons: doesn’t include all geospacial boundaries you might want!
sf
- pros: work with any common spacial object (like those used in ArcGIS) plus well maintained and up to date!
- cons: a bit more of a learning curve to use

PA 10: Map it!

The best way to learn how to make nice maps with ggplot is to just jump in yourself!

To do…

PA 10: Map it!
- Due Tuesday, 6/2 at 11:59pm.
Lab 9
- Due Tuesday, 6/2 at 11:59pm.
Final Project Report
- Due Friday, 6/5 at 11:59pm.
Course Evaluation
- Closes Friday, 6/5 at 11:59pm.