Model Validation + Geospacial Graphics

Tuesday, June 3

Today we will…

They are coming along very nicely!
Final submission should be a polished report.
- Integrate writing and analysis.
Think about the readability of the numbers you are presenting.
- Do you need 6 decimal places?
- Is scientific notation easily understood?
Include units on your plots including any transformations
Don’t display the raw R lm() output

There are lots of different metrics to measure model performance!
- Your choice depends on the type of model and what is “good” for the context
- Do you only care about predictions? Or more about inference?

Even if a model is “good” according to our metric with the data we have, how do we know if it will still work well with other data?

Big idea: Reserve part of your data for testing to get an idea of how the model would do on outside data
1. Split your data into a training and a testing set (typically 80%/20%)
2. Set the testing set aside
3. Do all the model development you want with the training data
4. Fit a final model with the training data
5. Generate predictions from that model with the testing data and calculate your performance metric
6. Compare the testing and training metrics

If the model is overfit…
- test performance < training performance
If the model is underfit …
- test performance > training performance
If the model is neither over nor underfit …
- test performance \(\approx\) training performance

Iterates over \(k\) train / test splits
Uses all of the data
Especially useful for model development
- covariate selection
- specifying parameters for ML models

Choose a value for \(k\)
Split data into \(k\) folds
For fold \(i\) from 1 … \(k\):
1. Fit model on observations not in fold \(i\)
2. Generate predictions from this model for the observations in fold \(i\)
3. Calculate and save performance metric of interest
Average the \(k\) performance metrics across folds

0.Choose a value for \(k\)

k <- 5

n <- nrow(ncbirths) 

ncbirths <- ncbirths |> 
  mutate(fold_cut = cut(1:n, breaks = k, labels = FALSE))

ncbirths <- ncbirths |> 
  mutate(fold_random = sample(rep_len(1:k, length.out = n),
                       size = n))

We implemented 5-fold CV for the model of birth weight on gestation weeks with the NC births data…

cv_r2

[1] 0.4677269 0.5130754 0.4223354 0.4381092 0.3779959

mean(cv_r2)

[1] 0.4438486

The \(R^2\) from fitting the model on the full dataset was 0.449, so it appears the model is neither overfitting or underfitting.

Plotting geospacial data can uncover patterns that would be hard to determine through other analyses …

… It can also help make grouping of observations in your analysis clear!

There are now many tools in R to plot geospacial data
maps / mapdata + geom_polygon()
- pros: simplest way to map the US counties / states and world countries
- cons: doesn’t include all geospacial boundaries you might want!
sf
- pros: work with any common spacial object (like those used in ArcGIS) plus well maintained and up to date!
- cons: a bit more of a learning curve to use

You are implementing CV and animated plots in your project, so we’ll take this time to practice making nice maps with ggplot!