Today we will…
Even if a model is “good” according to our metric with the data we have, how do we know if it will still work well with other data?
Big idea: Reserve part of your data for testing to get an idea of how the model would do on outside data
If the model is overfit…
If the model is underfit …
If the model is neither over nor underfit …
Choose a value for \(k\)
Split data into \(k\) folds
For fold \(i\) from 1 … \(k\):
Average the \(k\) performance metrics across folds
0.Choose a value for \(k\)
We implemented 5-fold CV for the model of birth weight on gestation weeks with the NC births data…
The \(R^2\) from fitting the model on the full dataset was 0.449, so it appears the model is neither overfitting or underfitting (maybe slightly overfitting).
cv_r2 <- rep(NA, k)
# for each fold 1-k...
for(x in 1:k){
# separate fold (test) data
fold_dat <- ncbirths |>
filter(fold_random == x)
# and training data
train_dat <- ncbirths |>
filter(fold_random != x)
# fit model with training data
it_lm <- lm(weight ~ weeks,
data = train_dat)
# generate predictions for the held-out fold data
fold_preds <- predict(it_lm, newdata = fold_dat)
# calculate R2 for the held-out fold data and save it
cv_r2[x] <- r2_calc(obs = fold_dat$weight,
pred = fold_preds)
}Important
You may need this function on the final exam…
Plotting geospacial data can uncover patterns that would be hard to determine through other analyses …
… It can also help make grouping of observations in your analysis clear!
R to plot geospacial datamaps / mapdata + geom_polygon()
sf
The best way to learn how to make nice maps with ggplot is to just jump in yourself!