Data Joins + Pivots

Tuesday, April 22

Today we will…

  • Big Picture
  • New Material
    • Joining data with dplyr
    • Pivoting data with tidyr
  • PA 4: Military Spending

Follow along

Remember to download, save, and open up the starter notes for this week! This week you need to download and save some data as well.

Comments from Week 3

  • Very nice work overall!!
  • Work on the lab early!
    • Get an idea of how long it may take and what any big challenges are
  • Be thoughtful about use of color in plots
  • Only save variables / intermediate objects when needed
  • Avoid long lines of code

Big Picture - Data Science Workflow

Data Science Process

Adapted from r4ds

We have covered…

Today

Building an Analytical Dataset

Getting Data

  • So far, we have simply needed to import one nice rectangular data set in a typical file type
  • Real life often gets a bit more complicated!!

Motivating (Real) Example1

  • Texas Education Data (AEIS): K-12 student performance data provided by the Texas Education Agency
  • Provides A LOT of information … in many separate files:
    • by year
    • for State, Regions, Districts, or Schools
    • for different sets of variables

🧐 need to get this #%$ together before we can analyze it

Relational Data

  • Multiple, interconnected tables of data are called relational.
  • Individual datsets may not provide exactly what we need - but we can use the relation between datasets to get the information we want.

IMDb movie relational data

Example - IMDb Movie Data

Discussion

What if we want to know which actor has worked with the most directors in the dataset?

What analytical dataset would we need to answer this question? What are the rows, and variables needed?

Example - IMDb Movie Data

Discussion

What if we want to know which actor has worked with the most directors in the dataset?

What analytical dataset would we need to answer this question? What are the rows, and variables needed?

  • 💡 In order to answer our question, we need to combine some of the individual datasets into one big dataset
  • Joins!

Data Joins

Data Joins

We can combine (join) data tables based on their relations.

Mutating joins

Add variables from a new dataframe to observations in an existing dataframe.

full_join(), left_join(), right_join(), inner_join()

Filtering Joins

Filter observations based on values in new dataframe.

semi_join(), anti_join()

Keys

Some combination of variables (should) uniquely identify an observation in a data set.

  • To combine (join) two datasets, a key needs to be present in both.

General Structure of a Join

  • Choose a left and a right dataset
  • Add or remove rows based on the type of join and the structure of the left vs. right data
  • Add columns (or not) based on the type of join and the and the structure of the left vs. right data

inner_join()

Keeps observations when their keys are present in both datasets.

Discussion

When would you want to use inner_join()?

inner_join(): IMDb Example

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
movies_directors
director_id movie_id
429 300229
9247 124110
11652 10920
11652 333856
14927 192017
15092 109093
15092 237431

ID: 429, 2931, 11652, 14927, 15092       ID: 429, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors, 
           by = "director_id")
director_id genre prob movie_id
429 Adventure 0.750000 300229
429 Fantasy 0.750000 300229
11652 Sci-Fi 0.500000 10920
11652 Sci-Fi 0.500000 333856
11652 Action 0.500000 10920
11652 Action 0.500000 333856
14927 Animation 1.000000 192017
14927 Family 1.000000 192017
15092 Comedy 0.545455 109093
15092 Comedy 0.545455 237431
15092 Crime 0.545455 109093
15092 Crime 0.545455 237431

ID: 429, 2931, 9247, 11652, 14927, 15092

inner_join(): IMDb Example

What if our key does not have the same name?

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
directors
id first_name last_name
429 Andrew Adamson
9247 Zach Braff
11652 James (I) Cameron
14927 Ron Clements
15092 Ethan Coen
inner_join(directors_genres, 
           directors, 
           by = join_by(director_id == id))
id first_name last_name genre prob
429 Andrew Adamson Adventure 0.750000
429 Andrew Adamson Fantasy 0.750000
11652 James (I) Cameron Sci-Fi 0.500000
11652 James (I) Cameron Action 0.500000
14927 Ron Clements Animation 1.000000
14927 Ron Clements Family 1.000000
15092 Ethan Coen Comedy 0.545455
15092 Ethan Coen Crime 0.545455

Join by different variables on dataX and dataY: join_by(a == b) will match dataX$a to dataY$b.

Piping Joins

Remember: the dataset you pipe in becomes the first argument of the function you are piping into!

  • If you are using a pipe,
    • the piped in data is the left dataset
    • specify the right dataset inside the join function.
inner_join(directors_genres, movies_directors)

…is equivalent to…

directors_genres |> 
  inner_join(movies_directors, by = "director_id")

More Mutating Joins

  • left_join() – keep only (and all) observations present in the left data set

  • right_join() – keep only (and all) observations present in the right data set

  • full_join() – keep only (and all) observations present in both data sets

Why Use a Certain Mutating Join?

  • inner_join()
    • You want all of the columns from both left and right data and only to include the observations that have information in both
  • left_join()
    • The left dataset is your “main” data and you just want to add information (columns) from the right dataset
  • right_join()
    • The right data is your “main data” and you just want to add columns from the left dataset
  • full_join()
    • You want all of the columns from both left and right data for all of the observations possible

Which Join?

Discussion

How many movies are there in the data for each director (by name), including if any directors don’t have any movies in the data? Which join should I use??

directors |> 
  ??_join(movies_directors, 
          by = join_by("id" == "director_id"))


directors |> 
  slice_head(n = 5)
id first_name last_name
429 Andrew Adamson
2931 Darren Aronofsky
9247 Zach Braff
11652 James (I) Cameron
14927 Ron Clements
movies_directors |> 
  slice_head(n = 5)
director_id movie_id
429 300229
2931 254943
9247 124110
11652 10920
11652 333856

Which Join?

Discussion

What is the complete set movies and actors included in the data? Which join should I use??

roles |> 
  ??_join(actors, 
          by = join_by("actor_id" == "id"))


roles |> 
  slice_head(n = 5)
actor_id movie_id role
933 333856 Lewis Bodine
2547 300229 Duloc Mascot
2700 306032 Tyrone
2898 333856 Slovakian three-year-old boy
2925 192017 Additional Voices
actors |> 
  slice_head(n = 5)
id first_name last_name gender film_count
933 Lewis Abernathy M 1
2547 Andrew Adamson M 1
2700 William Addy M 1
2898 Seth (I) Adkins M 1
2925 Charles (I) Adler M 1

Filtering Joins: semi_join()

Keeps observations when their keys are present in both datasets, but only keeps variables from the left dataset.


→  

IMDb Data Example

directors_genres |> 
  distinct(director_id)
director_id
429
2931
11652
14927
15092
movies_directors |> 
  distinct(director_id)
director_id
429
9247
11652
14927
15092

Filtering Joins: semi_join()

directors_genres |> 
  semi_join(movies_directors)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(director_id %in% movies_directors$director_id)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Filtering Joins: anti_join()

Removes observations when their keys are present in both datasets, and only keeps variables from the left dataset.


→  


Filtering Joins: anti_join()

directors_genres |> 
  anti_join(movies_directors)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(!director_id %in% movies_directors$director_id)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

Building an Analytical Dataset

Now we have tools to:

  • Combine multiple data sets (xx_join())
  • Subset to certain observations (filter() and xx_join())
  • Create new variables (mutate())
  • Select columns of interest (select())

We are well on our way to building and cleaning up a nice dataset! 🥳

Transform and Tidy

What’s next?

Transform and Tidy

What’s next?

We may need to transform our data to turn it into the version of tidy that is best for a task at hand.

Allison Horst

Reshaping Data Layouts with Pivots

Creating Tidy Data

Let’s say we want to look at mean cereal nutrients based on shelf.

  • The data are in a wide format – a separate column for each nutrient.
library(liver)
data(cereal)
head(cereal)
name manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran N cold 70 4 1 130 10.0 5.0 6 280 25 3 1 0.33 68.40297
100% Natural Bran Q cold 120 3 5 15 2.0 8.0 8 135 0 3 1 1.00 33.98368
All-Bran K cold 70 4 1 260 9.0 7.0 5 320 25 3 1 0.33 59.42551
All-Bran with Extra Fiber K cold 50 4 0 140 14.0 8.0 0 330 25 3 1 0.50 93.70491
Almond Delight R cold 110 2 2 200 1.0 14.0 8 -1 25 3 1 0.75 34.38484
Apple Cinnamon Cheerios G cold 110 2 2 180 1.5 10.5 10 70 25 1 1 0.75 29.50954

Creating Tidy Data

Discussion

How would we plot the mean cereal nutrients by shelf (as shown below) with the wide data using ggplot2?

Transforming the data will make plotting much easier

Creating Tidy Data

Code
cereal_wide <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))
shelf calories protein fat sodium fiber carbo sugars potass vitamins
1 102.5000 2.650000 0.60 176.2500 1.6850000 15.80000 4.800000 75.50000 20.00000
2 109.5238 1.904762 1.00 145.7143 0.9047619 13.61905 9.619048 57.80952 23.80952
3 107.7778 2.861111 1.25 158.6111 3.1388889 14.50000 6.527778 129.83333 35.41667
Code
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_wide |> 
  ggplot() +
  geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

Code
cereal_long<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))
shelf Nutrient mean_amount
1 calories 102.5000000
1 carbo 15.8000000
1 fat 0.6000000
1 fiber 1.6850000
1 potass 75.5000000
1 protein 2.6500000
1 sodium 176.2500000
1 sugars 4.8000000
1 vitamins 20.0000000
2 calories 109.5238095
2 carbo 13.6190476
2 fat 1.0000000
2 fiber 0.9047619
2 potass 57.8095238
2 protein 1.9047619
2 sodium 145.7142857
2 sugars 9.6190476
2 vitamins 23.8095238
3 calories 107.7777778
3 carbo 14.5000000
3 fat 1.2500000
3 fiber 3.1388889
3 potass 129.8333333
3 protein 2.8611111
3 sodium 158.6111111
3 sugars 6.5277778
3 vitamins 35.4166667
Code
cereal_long |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

Data Layouts

Kelsey Gonzalez

Manual Method

Consider daily rainfall observed in SLO in January 2023.

  • The data is in a human-friendly form (like a calendar).
  • Each week has a row, and each day has a column.

Data source

Discussion

How would you manually convert this to long format?

Manual Method: Steps

  1. Keep the column Week.
  2. Create a new column Day_of_Week.
  3. Create a new column Rainfall (hold daily rainfall values).
  4. Now we have three columns – move Sunday values over.
  5. Duplicate Week 1-5 and copy Monday values over.
  1. Repeat …

Computational Approach

We can use pivot_longer() to turn a wide dataset into a long(er) dataset.

pivot_longer()

Take a wide dataset and turn it into a long dataset.

  • cols – specify the columns that should be pivoted.
    • Do not include the names of ID columns (columns to not be pivoted).
  • names_to – the name of the new column containing the old column names.
  • values_to – the name of the new column containing the old column values.

pivot_longer()

slo_rainfall |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")
Week Day_of_Week Daily_Rainfall
1 Sunday 0.00
1 Monday 0.12
1 Tuesday 0.00
1 Wednesday 1.58
1 Thursday 0.91
1 Friday 0.00
1 Saturday 0.05
2 Sunday 0.27
2 Monday 4.26
2 Tuesday 0.43
2 Wednesday 0.00
2 Thursday 0.00
2 Friday 0.16
2 Saturday 1.41
3 Sunday 0.34
3 Monday 0.33
3 Tuesday 0.00
3 Wednesday 0.00
3 Thursday 0.13
3 Friday 0.00
3 Saturday 0.00
4 Sunday 0.00
4 Monday 0.00
4 Tuesday 0.00
4 Wednesday 0.00
4 Thursday 0.00
4 Friday 0.00
4 Saturday NA
5 Sunday NA
5 Monday NA
5 Tuesday NA
5 Wednesday NA
5 Thursday NA
5 Friday NA
5 Saturday NA

Long to Wide

What are the mean amount of protein for cereals on each shelf and for each manuf?

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))
manuf shelf mean_protein
A 2 4.000000
G 1 3.000000
G 2 1.285714
G 3 2.666667
K 1 2.750000
K 2 2.142857
K 3 2.916667
N 1 2.666667
N 2 2.500000
N 3 4.000000
P 1 1.500000
P 2 1.000000
P 3 3.000000
Q 1 5.000000
Q 2 2.000000
Q 3 2.500000
R 1 2.000000
R 3 3.000000

Discussion

What could we do to make this table easier to understand?

pivot_wider()

Take a long dataset and turn it into a wide dataset.

  • id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.
  • names_from – the name of the column containing the new column names.
  • values_from – the name of the column containing the new column values.

pivot_wider()

Much easier to read!

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)
manuf 1 2 3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

Better names in pivot_wider()

Even better!

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf_")
manuf Shelf_1 Shelf_2 Shelf_3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

PA 4: Military Spending

Today you will be tidying messy data to explore the relationship between countries of the world and military spending.

Resources - use them all!

  • dplyr and tidyr cheatsheets
  • slides and textbook
  • help files
  • ⭐️️ each other⭐️
  • and me 😎

Don’t struggle individually for too long!

To do…

  • PA 4: Military Spending
    • Due Thursday, 4/24 before class