
Today we will…
Don’t write a whole function at once (if it is complicated)!
Set up intermediate tests
Print a lot
Just staring at your code probably won’t help
Recall from your statistics classes…
A random variable is a value we don’t know until we observe an instance.
The distribution of a random variable tells us its possible values and how likely they are.

Uniform Distribution

Normal Distribution

t-Distribution

Chi-Square Distribution

Binomial Distribution

r is for random sampling.
p is for probability.
x: \(P(X < x)\)q is for quantile.
q functions are “backwards” of the p functions.d is for density.
Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?
Empirical: the observed data.
We can generate fake (“synthetic”) data based on the assumption that a variable follows a certain distribution.
We randomly sample observations from the distribution.
set.seed()Since there is randomness involved, we will get a different result each time we run the code.
To generate a reproducible random sample, we first set the seed:
Whenever you are doing an analysis that involves a random element, you should set the seed!
| names | height | age | measure | supports_measure_A | 
|---|---|---|---|---|
| Elbridge Kautzer | 67.43632 | 66.29460 | 1 | yes | 
| Brandon King | 64.99480 | 61.53720 | 0 | no | 
| Phyllis Thompson | 68.09035 | 53.83715 | 1 | yes | 
| Humberto Corwin | 67.45541 | 33.87560 | 1 | yes | 
| Theresia Koelpin | 71.37196 | 16.12199 | 1 | yes | 
| Hayden O'Reilly-Johns | 66.17853 | 36.96293 | 0 | no | 
Check to see the ages look uniformly distributed.
Use sample() to take a random sample of values from a vector.
[1] "horse"   "cat"     "chicken"[1] "dog"   "horse" "dog"   "cat"   "goat"  [1] 1 0 0 0 0 0 0 0 0 0Warning
Whenever you take a sample, think about if you want to take a sample with or without replacement. The default is to sample without replacement.
Use slice_sample() to take a random sample of observations (rows) from a dataset.
| names | height | age | measure | supports_measure_A | 
|---|---|---|---|---|
| Alexander Nicolas | 60.78593 | 25.87201 | 0 | no | 
| Marnie Witting | 67.55575 | 48.26608 | 1 | yes | 
| Liddie Wiza-Pouros | 66.36513 | 29.91378 | 1 | yes | 
| names | height | age | measure | supports_measure_A | 
|---|---|---|---|---|
| Debera Kirlin | 70.01628 | 20.18689 | 0 | no | 
| Demario Muller | 69.03207 | 34.78672 | 1 | yes | 
| Alvera Mayert | 66.06743 | 57.62611 | 0 | no | 
| Dr. Duwayne Gleichner | 64.79083 | 31.31543 | 0 | no | 
| Dr. Bethany Fisher | 71.70982 | 33.81118 | 1 | yes | 
Suppose there is a group of 50 people.
Write a function to …
Use a map() function to repeat this simulation 1,000 times.
We can automatically include code output in the written portion of a Quarto document using `r `.
Type this: `r mean(sim_results)*100`% of the datasets contain at least two people with the same birthday.
To get this: 96.9% of the datasets contain at least two people with the same birthday.

Communicating about your analysis and findings is a key element of statistical computing.
Which is clearer to a general audience?:
In this analysis, we use data from the US Department of Labor which includes a variety of measurements of a state’s minimum wage for US states and territories by year. We additionally include information from a dataset provided by the Harvard Dataverse on state party leanings by year. Our analysis includes the years 1976 - 2020 and the 50 US states.
In this analysis, we use
inner_join()to joinus-party-data.csvandus-minimum-wage-data.csvbyyearandstate.
Stepping back…
sec.cols <- c("#fdb462", "#b3de69")
trip.cols <- c("#fb8072", "#80b1d3")
fish <- fish |>
  mutate(trip = str_c("Trip ", trip)) |>
  mutate(across(.cols = c(trip, section, species),
                .fns = ~ as.factor(.x)))
fish |>
  filter(if_any(.cols = everything(),
                .fns = ~ is.na(.x))) |>
  ggplot(aes(x = year,
             fill = trip)) +
  geom_bar() +
  facet_grid(~ section) +
  scale_fill_manual(values = trip.cols) +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year",
       fill = "Trip Number")
fish |>
  filter(if_any(.cols = everything(),
                .fns = ~ is.na(.x))) |>
  ggplot(aes(x = year,
             fill = trip)) +
  geom_bar(position = "dodge") +
  facet_grid(~ section) +
  scale_fill_manual(values = trip.cols) +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year",
       fill = "Trip Number") +
  theme(legend.position = "bottom")
fish |>
  filter(if_any(.cols = everything(),
                .fns = ~ is.na(.x))) |>
  ggplot(aes(x = year,
             fill = section)) +
  geom_bar() +
  facet_grid(~ trip) +
  scale_fill_manual(values =  sec.cols) +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year",
       fill = "Section")
fish |>
  filter(if_any(.cols = everything(),
                .fns = ~ is.na(.x))) |>
  ggplot(aes(x = year,
             fill = section)) +
  geom_bar(position = "dodge") +
  facet_grid(~ trip) +
  scale_fill_manual(values = sec.cols) +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year",
       fill = "Section") +
  theme(legend.position = "bottom")



fish_sum <- fish |>
  group_by(year, section, trip) |>
  summarize(n_miss = sum(is.na(weight)))
fish_sum |>
  ggplot(aes(x = year,
             y = n_miss,
             color = trip)) +
  geom_line(linewidth = 1) +
  geom_point() +
  facet_grid(cols = vars(section)) +
  scale_color_manual(values = trip.cols) +
  theme_bw() +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year",
       color = "Trip")
fish_sum |>
  ggplot(aes(x = year,
             y = n_miss,
             color = section)) +
  geom_line(linewidth = 1) +
  geom_point() +
  facet_grid(cols = vars(trip)) +
  scale_color_manual(values = sec.cols) +
  theme_bw() +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year",
       color = "Section")
fish_sum |>
  ggplot(aes(x = year,
             y = n_miss)) +
  geom_line(linewidth = 1) +
  geom_point() +
  facet_grid(cols = vars(trip),
             rows = vars(section)) +
  theme_bw() +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year")
fish_sum |>
  ggplot(aes(x = year,
             y = n_miss,
             color = section,
             linetype = trip)) +
  geom_line(linewidth = 1) +
  geom_point() +
  scale_color_manual(values = sec.cols) +
  scale_linetype_manual(values = c(2, 1)) +
  theme_bw() +
  labs(y = "",
       subtitle = "Number of Missing Weight Values",
       x = "Year",
       color = "Section",
       linetype = "Trip")



Rkable()kableExtra or gt!

kable() and kableExtra functionsgt package“Raw” Table:
kableExtratab_dat |> 
  arrange(desc(avg_weight)) |> 
  kable(digits = c(0, 1, 1, 0),
        col.names = c("Species", "Mean", "SD", "N. Samples"),
        caption = "Summaries of fish weights by species across all sampling years (between 1989 - 2006) trips and sites.") |>
  kable_classic(full_width = F,
                bootstrap_options = "striped") |> 
  add_header_above(c(" " = 1, "Weight (g)" = 2," " = 1),
                   bold = TRUE) |> 
  row_spec(row = 0, bold = T, align = "c")| Weight (g) | |||
|---|---|---|---|
| Species | Mean | SD | N. Samples | 
| Bull | 598.4 | 635.4 | 553 | 
| Brown | 425.9 | 381.8 | 3171 | 
| WCT | 266.4 | 179.5 | 2287 | 
| RBT | 183.2 | 182.3 | 12341 | 
gttab_dat |> 
  arrange(desc(avg_weight)) |> 
  gt() |> 
  tab_options(table.font.size = 32) |> 
  tab_header(
    title = "Summary of Fish Weights by Species",
    subtitle = "all sampling years, trips, and sites"
  ) |> 
  tab_spanner(label = md("**Weight (g)**"), 
              columns = c(avg_weight, sd_weight)) |> 
  tab_style(style = cell_text(align = "center"),
    locations = cells_column_labels()) |> 
  cols_align(align = "left",
             columns = species) |> 
  fmt_number(columns = c(avg_weight, sd_weight),
             decimals = 1) |> 
  fmt_number(columns = n,
             decimals = 0) |> 
  cols_label(
    "avg_weight" = md("**Mean**"),
    "sd_weight" = md("**SD**"),
    "n" = md("**N. Samples**"),
    "species" = md("**Species**")
  )| Summary of Fish Weights by Species | |||
| all sampling years, trips, and sites | |||
| Species | Weight (g) | N. Samples | |
|---|---|---|---|
| Mean | SD | ||
| Bull | 598.4 | 635.4 | 553 | 
| Brown | 425.9 | 381.8 | 3,171 | 
| WCT | 266.4 | 179.5 | 2,287 | 
| RBT | 183.2 | 182.3 | 12,341 | 
Work with statistical distributions to determine if an instrument salesman is lying.

Revisit previous lab problems through the lens of efficiency
map() instead of across()See you in a week!
Enjoy the long weekend! A reminder that we do not have class on Tuesday 5/27.
Write a function to simulate height data from a population with some mean and SD height.
The user should be able to input:
Create a set of parameters (mean and SD) for each population.
Simulate datasets with different mean and SD heights.
crossing(mean_ht = seq(from = 60, to = 78, by = 6),
         std_ht  = c(3, 6)
         ) |> 
 mutate(ht_data = pmap(.l = list(avg = mean_ht, std = std_ht), 
                       .f = sim_ht
                       )
        )# A tibble: 8 × 3
  mean_ht std_ht ht_data           
    <dbl>  <dbl> <list>            
1      60      3 <tibble [200 × 2]>
2      60      6 <tibble [200 × 2]>
3      66      3 <tibble [200 × 2]>
4      66      6 <tibble [200 × 2]>
5      72      3 <tibble [200 × 2]>
6      72      6 <tibble [200 × 2]>
7      78      3 <tibble [200 × 2]>
8      78      6 <tibble [200 × 2]>Why am I getting a tibble in the ht_data column?
Extract the contents of each list!
# A tibble: 10 × 4
   mean_ht std_ht person    ht
     <dbl>  <dbl>  <int> <dbl>
 1      60      3      1  51.0
 2      60      3      2  57.6
 3      60      3      3  64.9
 4      60      3      4  63.1
 5      60      3      5  62.0
 6      60      3      6  64.5
 7      60      3      7  64.2
 8      60      3      8  60.6
 9      60      3      9  62.6
10      60      3     10  63.7Why do I now have person and ht columns?
How many rows should I have for each mean_ht, std_ht combo?
nest() and unnest()map() family very nicely with two tidyr functions: nest() and unnest().nest() subsets of the data (as tibbles) inside a tibble.unnest() the data by row binding the subsets back together.Plot the samples simulated from each population.
fake_ht_data |> 
  mutate(across(.cols = mean_ht:std_ht, 
                .fns = ~as.character(.x)), 
         mean_ht = fct_recode(mean_ht, 
                              `Mean = 60` = "60", 
                              `Mean = 66` = "66", 
                              `Mean = 72` = "72", 
                              `Mean = 78` = "78"), 
         std_ht = fct_recode(std_ht, 
                             `Std = 3` = "3", 
                             `Std = 6` = "6")
         ) |> 
  ggplot(mapping = aes(x = ht)) +
  geom_histogram(color = "white") +
  facet_grid(std_ht ~ mean_ht) +
  labs(x = "Height (in)",
       y = "",
       subtitle = "Frequency of Observations",
       title = "Simulated Heights from Eight Different Populations")