
Today we will…
Don’t write a whole function at once (if it is complicated)!
Set up intermediate tests
Print a lot
Just staring at your code probably won’t help
Recall from your statistics classes…
A random variable is a value we don’t know until we observe an instance.
The distribution of a random variable tells us its possible values and how likely they are.

Uniform Distribution

Normal Distribution

t-Distribution

Chi-Square Distribution

Binomial Distribution

r is for random sampling.
p is for probability.
x: \(P(X < x)\)q is for quantile.
q functions are “backwards” of the p functions.d is for density.
Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?
Empirical: the observed data.
We can generate fake (“synthetic”) data based on the assumption that a variable follows a certain distribution.
We randomly sample observations from the distribution.
set.seed()Since there is randomness involved, we will get a different result each time we run the code.
To generate a reproducible random sample, we first set the seed:
Whenever you are doing an analysis that involves a random element, you should set the seed!
| names | height | age | measure | supports_measure_A |
|---|---|---|---|---|
| Elbridge Kautzer | 67.43632 | 66.29460 | 1 | yes |
| Brandon King | 64.99480 | 61.53720 | 0 | no |
| Phyllis Thompson | 68.09035 | 53.83715 | 1 | yes |
| Humberto Corwin | 67.45541 | 33.87560 | 1 | yes |
| Theresia Koelpin | 71.37196 | 16.12199 | 1 | yes |
| Hayden O'Reilly-Johns | 66.17853 | 36.96293 | 0 | no |
Check to see the ages look uniformly distributed.
Use sample() to take a random sample of values from a vector.
[1] "horse" "cat" "chicken"
[1] "dog" "horse" "dog" "cat" "goat"
[1] 1 0 0 0 0 0 0 0 0 0
Warning
Whenever you take a sample, think about if you want to take a sample with or without replacement. The default is to sample without replacement.
Use slice_sample() to take a random sample of observations (rows) from a dataset.
| names | height | age | measure | supports_measure_A |
|---|---|---|---|---|
| Alexander Nicolas | 60.78593 | 25.87201 | 0 | no |
| Marnie Witting | 67.55575 | 48.26608 | 1 | yes |
| Liddie Wiza-Pouros | 66.36513 | 29.91378 | 1 | yes |
| names | height | age | measure | supports_measure_A |
|---|---|---|---|---|
| Debera Kirlin | 70.01628 | 20.18689 | 0 | no |
| Demario Muller | 69.03207 | 34.78672 | 1 | yes |
| Alvera Mayert | 66.06743 | 57.62611 | 0 | no |
| Dr. Duwayne Gleichner | 64.79083 | 31.31543 | 0 | no |
| Dr. Bethany Fisher | 71.70982 | 33.81118 | 1 | yes |
Suppose there is a group of 50 people.
Write a function to …
Use a map() function to repeat this simulation 1,000 times.
We can automatically include code output in the written portion of a Quarto document using `r `.
Type this: `r mean(sim_results)*100`% of the datasets contain at least two people with the same birthday.
To get this: 96.9% of the datasets contain at least two people with the same birthday.

Communicating about your analysis and findings is a key element of statistical computing.
Which is clearer to a general audience?:
In this analysis, we use data from the US Department of Labor which includes a variety of measurements of a state’s minimum wage for US states and territories by year. We additionally include information from a dataset provided by the Harvard Dataverse on state party leanings by year. Our analysis includes the years 1976 - 2020 and the 50 US states.
In this analysis, we use
inner_join()to joinus-party-data.csvandus-minimum-wage-data.csvbyyearandstate.
Stepping back…
sec.cols <- c("#fdb462", "#b3de69")
trip.cols <- c("#fb8072", "#80b1d3")
fish <- fish |>
mutate(trip = str_c("Trip ", trip)) |>
mutate(across(.cols = c(trip, section, species),
.fns = ~ as.factor(.x)))
fish |>
filter(if_any(.cols = everything(),
.fns = ~ is.na(.x))) |>
ggplot(aes(x = year,
fill = trip)) +
geom_bar() +
facet_grid(~ section) +
scale_fill_manual(values = trip.cols) +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year",
fill = "Trip Number")
fish |>
filter(if_any(.cols = everything(),
.fns = ~ is.na(.x))) |>
ggplot(aes(x = year,
fill = trip)) +
geom_bar(position = "dodge") +
facet_grid(~ section) +
scale_fill_manual(values = trip.cols) +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year",
fill = "Trip Number") +
theme(legend.position = "bottom")
fish |>
filter(if_any(.cols = everything(),
.fns = ~ is.na(.x))) |>
ggplot(aes(x = year,
fill = section)) +
geom_bar() +
facet_grid(~ trip) +
scale_fill_manual(values = sec.cols) +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year",
fill = "Section")
fish |>
filter(if_any(.cols = everything(),
.fns = ~ is.na(.x))) |>
ggplot(aes(x = year,
fill = section)) +
geom_bar(position = "dodge") +
facet_grid(~ trip) +
scale_fill_manual(values = sec.cols) +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year",
fill = "Section") +
theme(legend.position = "bottom")



fish_sum <- fish |>
group_by(year, section, trip) |>
summarize(n_miss = sum(is.na(weight)))
fish_sum |>
ggplot(aes(x = year,
y = n_miss,
color = trip)) +
geom_line(linewidth = 1) +
geom_point() +
facet_grid(cols = vars(section)) +
scale_color_manual(values = trip.cols) +
theme_bw() +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year",
color = "Trip")
fish_sum |>
ggplot(aes(x = year,
y = n_miss,
color = section)) +
geom_line(linewidth = 1) +
geom_point() +
facet_grid(cols = vars(trip)) +
scale_color_manual(values = sec.cols) +
theme_bw() +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year",
color = "Section")
fish_sum |>
ggplot(aes(x = year,
y = n_miss)) +
geom_line(linewidth = 1) +
geom_point() +
facet_grid(cols = vars(trip),
rows = vars(section)) +
theme_bw() +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year")
fish_sum |>
ggplot(aes(x = year,
y = n_miss,
color = section,
linetype = trip)) +
geom_line(linewidth = 1) +
geom_point() +
scale_color_manual(values = sec.cols) +
scale_linetype_manual(values = c(2, 1)) +
theme_bw() +
labs(y = "",
subtitle = "Number of Missing Weight Values",
x = "Year",
color = "Section",
linetype = "Trip")



Rkable()kableExtra or gt!

kable() and kableExtra functionsgt package“Raw” Table:
kableExtratab_dat |>
arrange(desc(avg_weight)) |>
kable(digits = c(0, 1, 1, 0),
col.names = c("Species", "Mean", "SD", "N. Samples"),
caption = "Summaries of fish weights by species across all sampling years (between 1989 - 2006) trips and sites.") |>
kable_classic(full_width = F,
bootstrap_options = "striped") |>
add_header_above(c(" " = 1, "Weight (g)" = 2," " = 1),
bold = TRUE) |>
row_spec(row = 0, bold = T, align = "c")Weight (g) |
|||
|---|---|---|---|
| Species | Mean | SD | N. Samples |
| Bull | 598.4 | 635.4 | 553 |
| Brown | 425.9 | 381.8 | 3171 |
| WCT | 266.4 | 179.5 | 2287 |
| RBT | 183.2 | 182.3 | 12341 |
gttab_dat |>
arrange(desc(avg_weight)) |>
gt() |>
tab_options(table.font.size = 32) |>
tab_header(
title = "Summary of Fish Weights by Species",
subtitle = "all sampling years, trips, and sites"
) |>
tab_spanner(label = md("**Weight (g)**"),
columns = c(avg_weight, sd_weight)) |>
tab_style(style = cell_text(align = "center"),
locations = cells_column_labels()) |>
cols_align(align = "left",
columns = species) |>
fmt_number(columns = c(avg_weight, sd_weight),
decimals = 1) |>
fmt_number(columns = n,
decimals = 0) |>
cols_label(
"avg_weight" = md("**Mean**"),
"sd_weight" = md("**SD**"),
"n" = md("**N. Samples**"),
"species" = md("**Species**")
)| Summary of Fish Weights by Species | |||
| all sampling years, trips, and sites | |||
| Species |
Weight (g)
|
N. Samples | |
|---|---|---|---|
| Mean | SD | ||
| Bull | 598.4 | 635.4 | 553 |
| Brown | 425.9 | 381.8 | 3,171 |
| WCT | 266.4 | 179.5 | 2,287 |
| RBT | 183.2 | 182.3 | 12,341 |
Work with statistical distributions to determine if an instrument salesman is lying.

Revisit previous lab problems through the lens of efficiency
map() instead of across()See you in a week!
Enjoy the long weekend! A reminder that we do not have class on Tuesday 5/27.
Write a function to simulate height data from a population with some mean and SD height.
The user should be able to input:
Create a set of parameters (mean and SD) for each population.
Simulate datasets with different mean and SD heights.
crossing(mean_ht = seq(from = 60, to = 78, by = 6),
std_ht = c(3, 6)
) |>
mutate(ht_data = pmap(.l = list(avg = mean_ht, std = std_ht),
.f = sim_ht
)
)# A tibble: 8 × 3
mean_ht std_ht ht_data
<dbl> <dbl> <list>
1 60 3 <tibble [200 × 2]>
2 60 6 <tibble [200 × 2]>
3 66 3 <tibble [200 × 2]>
4 66 6 <tibble [200 × 2]>
5 72 3 <tibble [200 × 2]>
6 72 6 <tibble [200 × 2]>
7 78 3 <tibble [200 × 2]>
8 78 6 <tibble [200 × 2]>
Why am I getting a tibble in the ht_data column?
Extract the contents of each list!
# A tibble: 10 × 4
mean_ht std_ht person ht
<dbl> <dbl> <int> <dbl>
1 60 3 1 51.0
2 60 3 2 57.6
3 60 3 3 64.9
4 60 3 4 63.1
5 60 3 5 62.0
6 60 3 6 64.5
7 60 3 7 64.2
8 60 3 8 60.6
9 60 3 9 62.6
10 60 3 10 63.7
Why do I now have person and ht columns?
How many rows should I have for each mean_ht, std_ht combo?
nest() and unnest()map() family very nicely with two tidyr functions: nest() and unnest().nest() subsets of the data (as tibbles) inside a tibble.unnest() the data by row binding the subsets back together.Plot the samples simulated from each population.
fake_ht_data |>
mutate(across(.cols = mean_ht:std_ht,
.fns = ~as.character(.x)),
mean_ht = fct_recode(mean_ht,
`Mean = 60` = "60",
`Mean = 66` = "66",
`Mean = 72` = "72",
`Mean = 78` = "78"),
std_ht = fct_recode(std_ht,
`Std = 3` = "3",
`Std = 6` = "6")
) |>
ggplot(mapping = aes(x = ht)) +
geom_histogram(color = "white") +
facet_grid(std_ht ~ mean_ht) +
labs(x = "Height (in)",
y = "",
subtitle = "Frequency of Observations",
title = "Simulated Heights from Eight Different Populations")