directors_genres
Today we will…
dplyr
tidyr
Follow along
Remember to download, save, and open up the starter notes for this week! This week you need to download and save some data as well.
Adapted from r4ds
🧐 need to get this #%$ together before we can analyze it
Discussion
What if we want to know which actor has worked with the most directors in the dataset?
What analytical dataset would we need to answer this question? What are the rows, and variables needed?
Discussion
What if we want to know which actor has worked with the most directors in the dataset?
What analytical dataset would we need to answer this question? What are the rows, and variables needed?
We can combine (join) data tables based on their relations.
Mutating joins
Add variables from a new dataframe to observations in an existing dataframe.
full_join()
, left_join()
, right_join()
, inner_join()
Filtering Joins
Filter observations based on values in new dataframe.
semi_join()
, anti_join()
Some combination of variables (should) uniquely identify an observation in a data set.
inner_join()
Keeps observations when their keys are present in both datasets.
Discussion
When would you want to use inner_join()
?
inner_join()
: IMDb Exampledirector_id | genre | prob |
---|---|---|
429 | Adventure | 0.750000 |
429 | Fantasy | 0.750000 |
2931 | Drama | 0.714286 |
2931 | Action | 0.428571 |
11652 | Sci-Fi | 0.500000 |
11652 | Action | 0.500000 |
14927 | Animation | 1.000000 |
14927 | Family | 1.000000 |
15092 | Comedy | 0.545455 |
15092 | Crime | 0.545455 |
ID: 429, 2931, 11652, 14927, 15092 ID: 429, 9247, 11652, 14927, 15092
director_id | genre | prob | movie_id |
---|---|---|---|
429 | Adventure | 0.750000 | 300229 |
429 | Fantasy | 0.750000 | 300229 |
11652 | Sci-Fi | 0.500000 | 10920 |
11652 | Sci-Fi | 0.500000 | 333856 |
11652 | Action | 0.500000 | 10920 |
11652 | Action | 0.500000 | 333856 |
14927 | Animation | 1.000000 | 192017 |
14927 | Family | 1.000000 | 192017 |
15092 | Comedy | 0.545455 | 109093 |
15092 | Comedy | 0.545455 | 237431 |
15092 | Crime | 0.545455 | 109093 |
15092 | Crime | 0.545455 | 237431 |
ID: 429, 2931, 9247, 11652, 14927, 15092
inner_join()
: IMDb ExampleWhat if our key does not have the same name?
director_id | genre | prob |
---|---|---|
429 | Adventure | 0.750000 |
429 | Fantasy | 0.750000 |
2931 | Drama | 0.714286 |
2931 | Action | 0.428571 |
11652 | Sci-Fi | 0.500000 |
11652 | Action | 0.500000 |
14927 | Animation | 1.000000 |
14927 | Family | 1.000000 |
15092 | Comedy | 0.545455 |
15092 | Crime | 0.545455 |
id | first_name | last_name | genre | prob |
---|---|---|---|---|
429 | Andrew | Adamson | Adventure | 0.750000 |
429 | Andrew | Adamson | Fantasy | 0.750000 |
11652 | James (I) | Cameron | Sci-Fi | 0.500000 |
11652 | James (I) | Cameron | Action | 0.500000 |
14927 | Ron | Clements | Animation | 1.000000 |
14927 | Ron | Clements | Family | 1.000000 |
15092 | Ethan | Coen | Comedy | 0.545455 |
15092 | Ethan | Coen | Crime | 0.545455 |
Join by different variables on dataX
and dataY
: join_by(a == b)
will match dataX$a
to dataY$b
.
Remember: the dataset you pipe in becomes the first argument of the function you are piping into!
join
function.left_join()
– keep only (and all) observations present in the left data set
right_join()
– keep only (and all) observations present in the right data set
full_join()
– keep only (and all) observations present in both data sets
inner_join()
left_join()
right_join()
full_join()
Discussion
How many movies are there in the data for each director (by name), including if any directors don’t have any movies in the data? Which join should I use??
directors |>
??_join(movies_directors,
by = join_by("id" == "director_id"))
Discussion
What is the complete set movies and actors included in the data? Which join should I use??
roles |>
??_join(actors,
by = join_by("actor_id" == "id"))
actor_id | movie_id | role |
---|---|---|
933 | 333856 | Lewis Bodine |
2547 | 300229 | Duloc Mascot |
2700 | 306032 | Tyrone |
2898 | 333856 | Slovakian three-year-old boy |
2925 | 192017 | Additional Voices |
semi_join()
Keeps observations when their keys are present in both datasets, but only keeps variables from the left dataset.
→
semi_join()
director_id | genre | prob |
---|---|---|
429 | Adventure | 0.750000 |
429 | Fantasy | 0.750000 |
11652 | Sci-Fi | 0.500000 |
11652 | Action | 0.500000 |
14927 | Animation | 1.000000 |
14927 | Family | 1.000000 |
15092 | Comedy | 0.545455 |
15092 | Crime | 0.545455 |
Movie Directors: 429, 2931, 11652, 14927, 15092
anti_join()
Removes observations when their keys are present in both datasets, and only keeps variables from the left dataset.
→
anti_join()
Now we have tools to:
xx_join()
)filter()
and xx_join()
)mutate()
)select()
)We are well on our way to building and cleaning up a nice dataset! 🥳
What’s next?
What’s next?
We may need to transform our data to turn it into the version of tidy that is best for a task at hand.
Allison Horst
Let’s say we want to look at mean
cereal nutrients based on shelf
.
name | manuf | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100% Bran | N | cold | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1 | 0.33 | 68.40297 |
100% Natural Bran | Q | cold | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1 | 1.00 | 33.98368 |
All-Bran | K | cold | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1 | 0.33 | 59.42551 |
All-Bran with Extra Fiber | K | cold | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1 | 0.50 | 93.70491 |
Almond Delight | R | cold | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1 | 0.75 | 34.38484 |
Apple Cinnamon Cheerios | G | cold | 110 | 2 | 2 | 180 | 1.5 | 10.5 | 10 | 70 | 25 | 1 | 1 | 0.75 | 29.50954 |
Discussion
How would we plot the mean cereal nutrients by shelf (as shown below) with the wide data using ggplot2
?
Transforming the data will make plotting much easier
shelf | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins |
---|---|---|---|---|---|---|---|---|---|
1 | 102.5000 | 2.650000 | 0.60 | 176.2500 | 1.6850000 | 15.80000 | 4.800000 | 75.50000 | 20.00000 |
2 | 109.5238 | 1.904762 | 1.00 | 145.7143 | 0.9047619 | 13.61905 | 9.619048 | 57.80952 | 23.80952 |
3 | 107.7778 | 2.861111 | 1.25 | 158.6111 | 3.1388889 | 14.50000 | 6.527778 | 129.83333 | 35.41667 |
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")
cereal_wide |>
ggplot() +
geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
geom_line(aes(x = shelf, y = calories, color = "calories_col")) +
geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
scale_color_manual(values = my_colors, labels = names(my_colors)) +
labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")
shelf | Nutrient | mean_amount |
---|---|---|
1 | calories | 102.5000000 |
1 | carbo | 15.8000000 |
1 | fat | 0.6000000 |
1 | fiber | 1.6850000 |
1 | potass | 75.5000000 |
1 | protein | 2.6500000 |
1 | sodium | 176.2500000 |
1 | sugars | 4.8000000 |
1 | vitamins | 20.0000000 |
2 | calories | 109.5238095 |
2 | carbo | 13.6190476 |
2 | fat | 1.0000000 |
2 | fiber | 0.9047619 |
2 | potass | 57.8095238 |
2 | protein | 1.9047619 |
2 | sodium | 145.7142857 |
2 | sugars | 9.6190476 |
2 | vitamins | 23.8095238 |
3 | calories | 107.7777778 |
3 | carbo | 14.5000000 |
3 | fat | 1.2500000 |
3 | fiber | 3.1388889 |
3 | potass | 129.8333333 |
3 | protein | 2.8611111 |
3 | sodium | 158.6111111 |
3 | sugars | 6.5277778 |
3 | vitamins | 35.4166667 |
Kelsey Gonzalez
Consider daily rainfall observed in SLO in January 2023.
Discussion
How would you manually convert this to long format?
Week
.Day_of_Week
.Rainfall
(hold daily rainfall values).Week
1-5 and copy Monday values over.We can use pivot_longer()
to turn a wide dataset into a long(er) dataset.
pivot_longer()
Take a wide dataset and turn it into a long dataset.
cols
– specify the columns that should be pivoted.
names_to
– the name of the new column containing the old column names.values_to
– the name of the new column containing the old column values.pivot_longer()
Week | Day_of_Week | Daily_Rainfall |
---|---|---|
1 | Sunday | 0.00 |
1 | Monday | 0.12 |
1 | Tuesday | 0.00 |
1 | Wednesday | 1.58 |
1 | Thursday | 0.91 |
1 | Friday | 0.00 |
1 | Saturday | 0.05 |
2 | Sunday | 0.27 |
2 | Monday | 4.26 |
2 | Tuesday | 0.43 |
2 | Wednesday | 0.00 |
2 | Thursday | 0.00 |
2 | Friday | 0.16 |
2 | Saturday | 1.41 |
3 | Sunday | 0.34 |
3 | Monday | 0.33 |
3 | Tuesday | 0.00 |
3 | Wednesday | 0.00 |
3 | Thursday | 0.13 |
3 | Friday | 0.00 |
3 | Saturday | 0.00 |
4 | Sunday | 0.00 |
4 | Monday | 0.00 |
4 | Tuesday | 0.00 |
4 | Wednesday | 0.00 |
4 | Thursday | 0.00 |
4 | Friday | 0.00 |
4 | Saturday | NA |
5 | Sunday | NA |
5 | Monday | NA |
5 | Tuesday | NA |
5 | Wednesday | NA |
5 | Thursday | NA |
5 | Friday | NA |
5 | Saturday | NA |
What are the mean
amount of protein
for cereals on each shelf
and for each manuf
?
manuf | shelf | mean_protein |
---|---|---|
A | 2 | 4.000000 |
G | 1 | 3.000000 |
G | 2 | 1.285714 |
G | 3 | 2.666667 |
K | 1 | 2.750000 |
K | 2 | 2.142857 |
K | 3 | 2.916667 |
N | 1 | 2.666667 |
N | 2 | 2.500000 |
N | 3 | 4.000000 |
P | 1 | 1.500000 |
P | 2 | 1.000000 |
P | 3 | 3.000000 |
Q | 1 | 5.000000 |
Q | 2 | 2.000000 |
Q | 3 | 2.500000 |
R | 1 | 2.000000 |
R | 3 | 3.000000 |
Discussion
What could we do to make this table easier to understand?
pivot_wider()
Take a long dataset and turn it into a wide dataset.
id_cols
– specify the column(s) that contain the ID for unique rows in the wide dataset.names_from
– the name of the column containing the new column names.values_from
– the name of the column containing the new column values.pivot_wider()
Much easier to read!
manuf | 1 | 2 | 3 |
---|---|---|---|
G | 3.000000 | 1.285714 | 2.666667 |
K | 2.750000 | 2.142857 | 2.916667 |
N | 2.666667 | 2.500000 | 4.000000 |
P | 1.500000 | 1.000000 | 3.000000 |
Q | 5.000000 | 2.000000 | 2.500000 |
R | 2.000000 | NA | 3.000000 |
A | NA | 4.000000 | NA |
pivot_wider()
Even better!
manuf | Shelf_1 | Shelf_2 | Shelf_3 |
---|---|---|---|
G | 3.000000 | 1.285714 | 2.666667 |
K | 2.750000 | 2.142857 | 2.916667 |
N | 2.666667 | 2.500000 | 4.000000 |
P | 1.500000 | 1.000000 | 3.000000 |
Q | 5.000000 | 2.000000 | 2.500000 |
R | 2.000000 | NA | 3.000000 |
A | NA | 4.000000 | NA |
Today you will be tidying messy data to explore the relationship between countries of the world and military spending.
dplyr
and tidyr
cheatsheetsDon’t struggle individually for too long!
Comments from Week 3