Model Validation + Geospacial Graphics

Tuesday, June 3

Today we will…

  • Linear Regression: Feedack
  • New Material:
    • Model Validation
    • Graphics with Geospacial Data (Maps)
  • Work Time
    • PA 10: Map it!
    • PC5: Final Report

Linear Regression: Project Feedback

  • They are coming along very nicely!

  • Final submission should be a polished report.

    • Integrate writing and analysis.
  • Think about the readability of the numbers you are presenting.

    • Do you need 6 decimal places?
    • Is scientific notation easily understood?
  • Include units on your plots including any transformations

  • Don’t display the raw R lm() output

Model Validation

How do we tell if a model is “good”?

  • There are lots of different metrics to measure model performance!
    • Your choice depends on the type of model and what is “good” for the context
    • Do you only care about predictions? Or more about inference?
  • See week 9 slides
  • RMSE
  • \(R^2\)
  • overall prediction success or failure rate
  • sensitivity: true positive rate
  • specificity: true negative rater
  • many more…

Overfitting and Underfitting

Big Question for Model Validation

Even if a model is “good” according to our metric with the data we have, how do we know if it will still work well with other data?

Train / Test Split

  • Big idea: Reserve part of your data for testing to get an idea of how the model would do on outside data
    1. Split your data into a training and a testing set (typically 80%/20%)
    2. Set the testing set aside
    3. Do all the model development you want with the training data
    4. Fit a final model with the training data
    5. Generate predictions from that model with the testing data and calculate your performance metric
    6. Compare the testing and training metrics

Comparing Train / Test Performance

  • If the model is overfit…

    • test performance < training performance
  • If the model is underfit …

    • test performance > training performance
  • If the model is neither over nor underfit …

    • test performance \(\approx\) training performance

\(k\)-fold Cross Validation

  • Iterates over \(k\) train / test splits
  • Uses all of the data
  • Especially useful for model development
    • covariate selection
    • specifying parameters for ML models

\(k\)-fold Cross Validation Process

  1. Choose a value for \(k\)

  2. Split data into \(k\) folds

  3. For fold \(i\) from 1 … \(k\):

    1. Fit model on observations not in fold \(i\)
    2. Generate predictions from this model for the observations in fold \(i\)
    3. Calculate and save performance metric of interest
  4. Average the \(k\) performance metrics across folds

\(k\)-fold Cross Validation Process - details

0.Choose a value for \(k\)

  • Typical values are 5 or 10
k <- 5

\(k\)-fold Cross Validation Process - details

  1. Split data into \(k\) folds
  • Should be approximately the same size
  • You could just split the data into \(k\) groups based on their row number
n <- nrow(ncbirths) 

ncbirths <- ncbirths |> 
  mutate(fold_cut = cut(1:n, breaks = k, labels = FALSE))
  • More typical to assign groups randomly
ncbirths <- ncbirths |> 
  mutate(fold_random = sample(rep_len(1:k, length.out = n),
                       size = n))

\(k\)-fold Cross Validation Process - details

Visualization of 5-fold CV by Joshua Ebner

\(k\)-fold Cross Validation Process - details

We implemented 5-fold CV for the model of birth weight on gestation weeks with the NC births data…

cv_r2
[1] 0.4677269 0.5130754 0.4223354 0.4381092 0.3779959
  1. Average the \(k\) performance metrics across folds
mean(cv_r2)
[1] 0.4438486

The \(R^2\) from fitting the model on the full dataset was 0.449, so it appears the model is neither overfitting or underfitting.

Graphics with Geospacial Data (Maps)

We see geospacial graphics all the time

https://www.npr.org/sections/health-shots/2020/07/01/885263658/green-yellow-orange-or-red-this-new-tool-shows-covid-19-risk-in-your-county

Plotting geospacial data can uncover patterns that would be hard to determine through other analyses …

https://hpcf-files.umbc.edu/research/papers/REU2015Team2.pdf

… It can also help make grouping of observations in your analysis clear!

https://pmc.ncbi.nlm.nih.gov/articles/PMC11180987/

ArcGIS doesn’t get to have all the fun

  • There are now many tools in R to plot geospacial data
  • maps / mapdata + geom_polygon()
    • pros: simplest way to map the US counties / states and world countries
    • cons: doesn’t include all geospacial boundaries you might want!
  • sf
    • pros: work with any common spacial object (like those used in ArcGIS) plus well maintained and up to date!
    • cons: a bit more of a learning curve to use

PA 10: Map it!

You are implementing CV and animated plots in your project, so we’ll take this time to practice making nice maps with ggplot!

This image has nothing to do with this PA, it used to and is to fun to remove.

To do…

  • PA 10: Map it!
    • Due Thursday, 6/5 before class.
  • Final Project Report
    • Due Friday, 6/6 at 11:59pm.
  • Course Evaluation
    • Closes Friday, 6/6 at 11:59pm.