Data Joins + Pivots

Tuesday, April 22

Today we will…

Big Picture
New Material
- Joining data with dplyr
- Pivoting data with tidyr
PA 4: Military Spending

Follow along

Remember to download, save, and open up the starter notes for this week! This week you need to download and save some data as well.

Comments from Week 3

Very nice work overall!!
Work on the lab early!
- Get an idea of how long it may take and what any big challenges are
Be thoughtful about use of color in plots
Only save variables / intermediate objects when needed
Avoid long lines of code

Big Picture - Data Science Workflow

Data Science Process

Adapted from r4ds

We have covered…

Today

Building an Analytical Dataset

Getting Data

So far, we have simply needed to import one nice rectangular data set in a typical file type
Real life often gets a bit more complicated!!

Motivating (Real) Example¹

Texas Education Data (AEIS): K-12 student performance data provided by the Texas Education Agency
Provides A LOT of information … in many separate files:
- by year
- for State, Regions, Districts, or Schools
- for different sets of variables

🧐 need to get this #%$ together before we can analyze it

Relational Data

Multiple, interconnected tables of data are called relational.
Individual datsets may not provide exactly what we need - but we can use the relation between datasets to get the information we want.

Example - IMDb Movie Data

Discussion

What if we want to know which actor has worked with the most directors in the dataset?

What analytical dataset would we need to answer this question? What are the rows, and variables needed?

Example - IMDb Movie Data

Discussion

What if we want to know which actor has worked with the most directors in the dataset?

What analytical dataset would we need to answer this question? What are the rows, and variables needed?

💡 In order to answer our question, we need to combine some of the individual datasets into one big dataset
Joins!

Data Joins

We can combine (join) data tables based on their relations.

Mutating joins

Add variables from a new dataframe to observations in an existing dataframe.

full_join(), left_join(), right_join(), inner_join()

Filtering Joins

Filter observations based on values in new dataframe.

semi_join(), anti_join()

Keys

Some combination of variables (should) uniquely identify an observation in a data set.

To combine (join) two datasets, a key needs to be present in both.

General Structure of a Join

Choose a left and a right dataset
Add or remove rows based on the type of join and the structure of the left vs. right data
Add columns (or not) based on the type of join and the and the structure of the left vs. right data

`inner_join()`

Keeps observations when their keys are present in both datasets.

Discussion

When would you want to use inner_join()?

`inner_join()`: IMDb Example

directors_genres

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
2931	Drama	0.714286
2931	Action	0.428571
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

movies_directors

director_id	movie_id
429	300229
9247	124110
11652	10920
11652	333856
14927	192017
15092	109093
15092	237431

ID: 429, 2931, 11652, 14927, 15092 ID: 429, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors, 
           by = "director_id")

director_id	genre	prob	movie_id
429	Adventure	0.750000	300229
429	Fantasy	0.750000	300229
11652	Sci-Fi	0.500000	10920
11652	Sci-Fi	0.500000	333856
11652	Action	0.500000	10920
11652	Action	0.500000	333856
14927	Animation	1.000000	192017
14927	Family	1.000000	192017
15092	Comedy	0.545455	109093
15092	Comedy	0.545455	237431
15092	Crime	0.545455	109093
15092	Crime	0.545455	237431

ID: 429, ~~2931~~, ~~9247~~, 11652, 14927, 15092

`inner_join()`: IMDb Example

What if our key does not have the same name?

directors_genres

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
2931	Drama	0.714286
2931	Action	0.428571
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

directors

id	first_name	last_name
429	Andrew	Adamson
9247	Zach	Braff
11652	James (I)	Cameron
14927	Ron	Clements
15092	Ethan	Coen

inner_join(directors_genres, 
           directors, 
           by = join_by(director_id == id))

id	first_name	last_name	genre	prob
429	Andrew	Adamson	Adventure	0.750000
429	Andrew	Adamson	Fantasy	0.750000
11652	James (I)	Cameron	Sci-Fi	0.500000
11652	James (I)	Cameron	Action	0.500000
14927	Ron	Clements	Animation	1.000000
14927	Ron	Clements	Family	1.000000
15092	Ethan	Coen	Comedy	0.545455
15092	Ethan	Coen	Crime	0.545455

Join by different variables on dataX and dataY: join_by(a == b) will match dataX$a to dataY$b.

Piping Joins

Remember: the dataset you pipe in becomes the first argument of the function you are piping into!

If you are using a pipe,
- the piped in data is the left dataset
- specify the right dataset inside the join function.

inner_join(directors_genres, movies_directors)

…is equivalent to…

directors_genres |> 
  inner_join(movies_directors, by = "director_id")

More Mutating Joins

left_join() – keep only (and all) observations present in the left data set
right_join() – keep only (and all) observations present in the right data set
full_join() – keep only (and all) observations present in both data sets

Why Use a Certain Mutating Join?

inner_join()
- You want all of the columns from both left and right data and only to include the observations that have information in both
left_join()
- The left dataset is your “main” data and you just want to add information (columns) from the right dataset
right_join()
- The right data is your “main data” and you just want to add columns from the left dataset
full_join()
- You want all of the columns from both left and right data for all of the observations possible

Which Join?

Discussion

How many movies are there in the data for each director (by name), including if any directors don’t have any movies in the data? Which join should I use??

directors |> 
  ??_join(movies_directors, 
          by = join_by("id" == "director_id"))

directors |> 
  slice_head(n = 5)

id	first_name	last_name
429	Andrew	Adamson
2931	Darren	Aronofsky
9247	Zach	Braff
11652	James (I)	Cameron
14927	Ron	Clements

movies_directors |> 
  slice_head(n = 5)

director_id	movie_id
429	300229
2931	254943
9247	124110
11652	10920
11652	333856

Which Join?

Discussion

What is the complete set movies and actors included in the data? Which join should I use??

roles |> 
  ??_join(actors, 
          by = join_by("actor_id" == "id"))

roles |> 
  slice_head(n = 5)

actor_id	movie_id	role
933	333856	Lewis Bodine
2547	300229	Duloc Mascot
2700	306032	Tyrone
2898	333856	Slovakian three-year-old boy
2925	192017	Additional Voices

actors |> 
  slice_head(n = 5)

id	first_name	last_name	gender	film_count
933	Lewis	Abernathy	M	1
2547	Andrew	Adamson	M	1
2700	William	Addy	M	1
2898	Seth (I)	Adkins	M	1
2925	Charles (I)	Adler	M	1

Filtering Joins: `semi_join()`

Keeps observations when their keys are present in both datasets, but only keeps variables from the left dataset.

→

IMDb Data Example

directors_genres |> 
  distinct(director_id)

director_id
429
2931
11652
14927
15092

movies_directors |> 
  distinct(director_id)

director_id
429
9247
11652
14927
15092

directors_genres |> 
  semi_join(movies_directors)

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

Movie Directors: 429, ~~2931~~, 11652, 14927, 15092

directors_genres |>
  filter(director_id %in% movies_directors$director_id)

director_id	genre	prob
429	Adventure	0.750000
429	Fantasy	0.750000
11652	Sci-Fi	0.500000
11652	Action	0.500000
14927	Animation	1.000000
14927	Family	1.000000
15092	Comedy	0.545455
15092	Crime	0.545455

Filtering Joins: `anti_join()`

Removes observations when their keys are present in both datasets, and only keeps variables from the left dataset.

→

Filtering Joins: `anti_join()`

anti_join()
Connection to filter()

directors_genres |> 
  anti_join(movies_directors)

director_id	genre	prob
2931	Drama	0.714286
2931	Action	0.428571

Movie Directors: ~~429~~, 2931, ~~11652~~, ~~14927~~, ~~15092~~

directors_genres |>
  filter(!director_id %in% movies_directors$director_id)

director_id	genre	prob
2931	Drama	0.714286
2931	Action	0.428571

Building an Analytical Dataset

Now we have tools to:

Combine multiple data sets (xx_join())
Subset to certain observations (filter() and xx_join())
Create new variables (mutate())
Select columns of interest (select())

We are well on our way to building and cleaning up a nice dataset! 🥳

Transform and Tidy

What’s next?

Transform and Tidy

What’s next?

We may need to transform our data to turn it into the version of tidy that is best for a task at hand.

Allison Horst

Reshaping Data Layouts with Pivots

Creating Tidy Data

Let’s say we want to look at mean cereal nutrients based on shelf.

The data are in a wide format – a separate column for each nutrient.

library(liver)
data(cereal)
head(cereal)

name	manuf	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
100% Bran	N	cold	70	4	1	130	10.0	5.0	6	280	25	3	1	0.33	68.40297
100% Natural Bran	Q	cold	120	3	5	15	2.0	8.0	8	135	0	3	1	1.00	33.98368
All-Bran	K	cold	70	4	1	260	9.0	7.0	5	320	25	3	1	0.33	59.42551
All-Bran with Extra Fiber	K	cold	50	4	0	140	14.0	8.0	0	330	25	3	1	0.50	93.70491
Almond Delight	R	cold	110	2	2	200	1.0	14.0	8	-1	25	3	1	0.75	34.38484
Apple Cinnamon Cheerios	G	cold	110	2	2	180	1.5	10.5	10	70	25	1	1	0.75	29.50954

Creating Tidy Data

Discussion

How would we plot the mean cereal nutrients by shelf (as shown below) with the wide data using ggplot2?

Transforming the data will make plotting much easier

Creating Tidy Data

Wide
Wide Plot
Long
Long Plot

Code

cereal_wide <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))

shelf	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins
1	102.5000	2.650000	0.60	176.2500	1.6850000	15.80000	4.800000	75.50000	20.00000
2	109.5238	1.904762	1.00	145.7143	0.9047619	13.61905	9.619048	57.80952	23.80952
3	107.7778	2.861111	1.25	158.6111	3.1388889	14.50000	6.527778	129.83333	35.41667

Code

my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_wide |> 
  ggplot() +
  geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

Code

cereal_long<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))

shelf	Nutrient	mean_amount
1	calories	102.5000000
1	carbo	15.8000000
1	fat	0.6000000
1	fiber	1.6850000
1	potass	75.5000000
1	protein	2.6500000
1	sodium	176.2500000
1	sugars	4.8000000
1	vitamins	20.0000000
2	calories	109.5238095
2	carbo	13.6190476
2	fat	1.0000000
2	fiber	0.9047619
2	potass	57.8095238
2	protein	1.9047619
2	sodium	145.7142857
2	sugars	9.6190476
2	vitamins	23.8095238
3	calories	107.7777778
3	carbo	14.5000000
3	fat	1.2500000
3	fiber	3.1388889
3	potass	129.8333333
3	protein	2.8611111
3	sodium	158.6111111
3	sugars	6.5277778
3	vitamins	35.4166667

Code

cereal_long |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

Data Layouts

Kelsey Gonzalez

Manual Method

Consider daily rainfall observed in SLO in January 2023.

The data is in a human-friendly form (like a calendar).
Each week has a row, and each day has a column.

Data source

Discussion

How would you manually convert this to long format?

Manual Method: Steps

Keep the column Week.
Create a new column Day_of_Week.
Create a new column Rainfall (hold daily rainfall values).
Now we have three columns – move Sunday values over.
Duplicate Week 1-5 and copy Monday values over.

Repeat …

Computational Approach

We can use pivot_longer() to turn a wide dataset into a long(er) dataset.

`pivot_longer()`

Take a wide dataset and turn it into a long dataset.

cols – specify the columns that should be pivoted.
- Do not include the names of ID columns (columns to not be pivoted).
names_to – the name of the new column containing the old column names.
values_to – the name of the new column containing the old column values.

`pivot_longer()`

slo_rainfall |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")

Week	Day_of_Week	Daily_Rainfall
1	Sunday	0.00
1	Monday	0.12
1	Tuesday	0.00
1	Wednesday	1.58
1	Thursday	0.91
1	Friday	0.00
1	Saturday	0.05
2	Sunday	0.27
2	Monday	4.26
2	Tuesday	0.43
2	Wednesday	0.00
2	Thursday	0.00
2	Friday	0.16
2	Saturday	1.41
3	Sunday	0.34
3	Monday	0.33
3	Tuesday	0.00
3	Wednesday	0.00
3	Thursday	0.13
3	Friday	0.00
3	Saturday	0.00
4	Sunday	0.00
4	Monday	0.00
4	Tuesday	0.00
4	Wednesday	0.00
4	Thursday	0.00
4	Friday	0.00
4	Saturday	NA
5	Sunday	NA
5	Monday	NA
5	Tuesday	NA
5	Wednesday	NA
5	Thursday	NA
5	Friday	NA
5	Saturday	NA

Long to Wide

What are the mean amount of protein for cereals on each shelf and for each manuf?

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))

manuf	shelf	mean_protein
A	2	4.000000
G	1	3.000000
G	2	1.285714
G	3	2.666667
K	1	2.750000
K	2	2.142857
K	3	2.916667
N	1	2.666667
N	2	2.500000
N	3	4.000000
P	1	1.500000
P	2	1.000000
P	3	3.000000
Q	1	5.000000
Q	2	2.000000
Q	3	2.500000
R	1	2.000000
R	3	3.000000

Discussion

What could we do to make this table easier to understand?

`pivot_wider()`

Take a long dataset and turn it into a wide dataset.

id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.
names_from – the name of the column containing the new column names.
values_from – the name of the column containing the new column values.

`pivot_wider()`

Much easier to read!

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)

manuf	1	2	3
G	3.000000	1.285714	2.666667
K	2.750000	2.142857	2.916667
N	2.666667	2.500000	4.000000
P	1.500000	1.000000	3.000000
Q	5.000000	2.000000	2.500000
R	2.000000	NA	3.000000
A	NA	4.000000	NA

Better names in `pivot_wider()`

Even better!

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf_")

manuf	Shelf_1	Shelf_2	Shelf_3
G	3.000000	1.285714	2.666667
K	2.750000	2.142857	2.916667
N	2.666667	2.500000	4.000000
P	1.500000	1.000000	3.000000
Q	5.000000	2.000000	2.500000
R	2.000000	NA	3.000000
A	NA	4.000000	NA

PA 4: Military Spending

Today you will be tidying messy data to explore the relationship between countries of the world and military spending.

Resources - use them all!

dplyr and tidyr cheatsheets
slides and textbook
help files

⭐️️ each other⭐️
and me 😎

Don’t struggle individually for too long!

To do…

PA 4: Military Spending
- Due Thursday, 4/24 before class

Data Joins + Pivots

Tuesday, April 22

Comments from Week 3

Big Picture - Data Science Workflow

Data Science Process

We have covered…

Today

Building an Analytical Dataset

Getting Data

Motivating (Real) Example1

Relational Data

Example - IMDb Movie Data

Example - IMDb Movie Data

Data Joins

Data Joins

Keys

General Structure of a Join

inner_join()

inner_join(): IMDb Example

inner_join(): IMDb Example

Piping Joins

More Mutating Joins

Why Use a Certain Mutating Join?

Which Join?

Which Join?

Filtering Joins: semi_join()

IMDb Data Example

Filtering Joins: semi_join()

Filtering Joins: anti_join()

Filtering Joins: anti_join()

Building an Analytical Dataset

Transform and Tidy

Transform and Tidy

Reshaping Data Layouts with Pivots

Creating Tidy Data

Creating Tidy Data

Creating Tidy Data

Data Layouts

Manual Method

Manual Method: Steps

Computational Approach

pivot_longer()

pivot_longer()

Long to Wide

pivot_wider()

pivot_wider()

Better names in pivot_wider()

PA 4: Military Spending

Resources - use them all!

To do…

Motivating (Real) Example¹

`inner_join()`

`inner_join()`: IMDb Example

`inner_join()`: IMDb Example

Filtering Joins: `semi_join()`

Filtering Joins: `semi_join()`

Filtering Joins: `anti_join()`

Filtering Joins: `anti_join()`

`pivot_longer()`

`pivot_longer()`

`pivot_wider()`

`pivot_wider()`

Better names in `pivot_wider()`