
Today we will…
Recall from your statistics classes…
A random variable is a value we don’t know until we observe an instance.
The distribution of a random variable tells us its possible values and how likely they are.

Uniform Distribution

Normal Distribution

t-Distribution

Chi-Square Distribution

Binomial Distribution

min and max for the Uniform distribution
r is for random sampling.
n random values from a distribution.p is for probability.
x: \(P(X < x)\)q is for quantile.
p, compute \(x\) such that \(P(X < x) = p\).q functions are “backwards” of the p functions.d is for density.
x.x.Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?
Empirical: the observed data.
We can generate fake (“synthetic”) data based on the assumption that a variable follows a certain distribution.
We randomly sample observations from the distribution.
set.seed()Since there is randomness involved, we will get a different result each time we run the code.
To generate a reproducible random sample, we first set the seed:
Whenever you are doing an analysis that involves a random element, you should set the seed!
| names | height | age | measure | supports_measure_A |
|---|---|---|---|---|
| Elbridge Kautzer | 67.43632 | 66.29460 | 1 | yes |
| Brandon King | 64.99480 | 61.53720 | 0 | no |
| Phyllis Thompson | 68.09035 | 53.83715 | 1 | yes |
| Humberto Corwin | 67.45541 | 33.87560 | 1 | yes |
| Theresia Koelpin | 71.37196 | 16.12199 | 1 | yes |
| Hayden O'Reilly-Johns | 66.17853 | 36.96293 | 0 | no |
Check to see the ages look uniformly distributed.
Use sample() to take a random sample of values from a vector.
[1] "horse" "cat" "chicken"
[1] "dog" "horse" "dog" "cat" "goat"
[1] 1 0 0 0 0 0 0 0 0 0
Warning
Whenever you take a sample, think about if you want to take a sample with or without replacement. The default is to sample without replacement.
Use slice_sample() to take a random sample of observations (rows) from a dataset.
| names | height | age | measure | supports_measure_A |
|---|---|---|---|---|
| Alexander Nicolas | 60.78593 | 25.87201 | 0 | no |
| Marnie Witting | 67.55575 | 48.26608 | 1 | yes |
| Liddie Wiza-Pouros | 66.36513 | 29.91378 | 1 | yes |
| names | height | age | measure | supports_measure_A |
|---|---|---|---|---|
| Debera Kirlin | 70.01628 | 20.18689 | 0 | no |
| Demario Muller | 69.03207 | 34.78672 | 1 | yes |
| Alvera Mayert | 66.06743 | 57.62611 | 0 | no |
| Dr. Duwayne Gleichner | 64.79083 | 31.31543 | 0 | no |
| Dr. Bethany Fisher | 71.70982 | 33.81118 | 1 | yes |
Suppose there is a group of 50 people.
Write a function to …
n random people (assuming it is equally likely to be born any day of the year).Use a map() function to repeat this simulation 1,000 times.
We can automatically include code output in the written portion of a Quarto document using `r `.
Type this: `r mean(sim_results)*100`% of the iterations contain at least two people with the same birthday.
To get this: 96.9% of the iterations contain at least two people with the same birthday.
Write a function to simulate height data from a population with some mean and SD height.
The user should be able to input:
Create a set of parameters (mean and SD) for each population.
Goal: Simulate datasets with all of these different parameters.
In other words, we want to iterate across all of these parameter combinations.
Sounds like a job for purrr!
pmap() FamilyThese functions take in a list of vectors and a function.

pmap() FamilyThe vectors need to have the same names as the arguments of the function you are applying.
string pattern replacement
1 apple p P
2 banana n N
3 cherry h H
Simulate datasets with different mean and SD heights.
# A tibble: 8 × 3
mean_ht std_ht ht_data
<dbl> <dbl> <list>
1 60 3 <tibble [200 × 2]>
2 60 6 <tibble [200 × 2]>
3 66 3 <tibble [200 × 2]>
4 66 6 <tibble [200 × 2]>
5 72 3 <tibble [200 × 2]>
6 72 6 <tibble [200 × 2]>
7 78 3 <tibble [200 × 2]>
8 78 6 <tibble [200 × 2]>
Why am I getting a tibble in the ht_data column?
Extract the contents of each list!
# A tibble: 1,600 × 4
mean_ht std_ht person ht
<dbl> <dbl> <int> <dbl>
1 60 3 1 65.3
2 60 3 2 59.4
3 60 3 3 61.0
4 60 3 4 64.3
5 60 3 5 62.1
6 60 3 6 57.9
7 60 3 7 59.4
8 60 3 8 65.6
9 60 3 9 57.1
10 60 3 10 63.0
# ℹ 1,590 more rows
Why do I now have person and ht columns?
How many rows should I have for each mean_ht, std_ht combo?
nest() and unnest()map() family very nicely with two tidyr functions: nest() and unnest().nest() subsets the data (as tibbles) inside a tibble.unnest() the data by row binding the subsets back together.Plot the samples simulated from each population.
fake_ht_data |>
mutate(across(.cols = mean_ht:std_ht,
.fns = ~as.character(.x)),
mean_ht = fct_recode(mean_ht,
`Mean = 60` = "60",
`Mean = 66` = "66",
`Mean = 72` = "72",
`Mean = 78` = "78"),
std_ht = fct_recode(std_ht,
`Std = 3` = "3",
`Std = 6` = "6")
) |>
ggplot(mapping = aes(x = ht)) +
geom_histogram(color = "white") +
facet_grid(std_ht ~ mean_ht) +
labs(x = "Height (in)",
y = "",
subtitle = "Frequency of Observations",
title = "Simulated Heights from Eight Different Populations")

Communicating about your analysis and findings is a key element of statistical computing.
Which is clearer to a general audience?:
In this analysis, we use data from the US Department of Labor which includes a variety of measurements of a state’s minimum wage for US states and territories by year. We additionally include information from a dataset provided by the Harvard Dataverse on state party leanings by year. Our analysis includes the years 1976 - 2020 and the 50 US states.
In this analysis, we use
inner_join()to joinus-party-data.csvandus-minimum-wage-data.csvbyyearandstate.
Work with statistical distributions and iterating random processes to determine if an instrument salesman is lying.

–>

See you in a week!
Enjoy the long weekend! A reminder that we do not have class on Monday 5/25.