Week 4, Part 1: Data Joins and Pivots – Introduction to Statistical Computing in R

1 Learning Objectives

Determine what data format is necessary to generate a desired plot or statistical model.
Understand the differences between “wide” and “long” format data and how to transition between the two structures.
Understand relational data formats and how to use data joins to assemble data from multiple tables into a single table.

📖 Readings: 75 min

📽 Optional Videos: 25 min

2 Relational Data and Joins

📖 Required Reading: R4DS 19.1-19.4: Joins

Note: don’t read 19.5! We will not cover non-equi joins in this course

3 Data Transformation

Data Pivots

📖 Required Reading: R4DS 5: Data tidying

📽 Optional Videos

Splitting Cells

Expanding on what we talked about in class, there’s a task that is fairly commonly encountered with functions that belong to the tidyr package: separating variables into two different columns separate_xxx() and it’s complement, unite(), which is useful for combining two variables into one column.

The datasets used in the examples below are included in the tidyr package!

Table 3 from before is a great example of when you would want to separate the values of a column into different columns. In this dataset, the variable rate is recorded as 745/19987071, where the first number is the amount of cases and the second number is the population.

A diagram labeled 'table3' shows two tables side by side with an arrow pointing from left to right. On the left, a table has columns country, year, and rate. The rate column contains combined values such as 0.7K/19M, 2K/20M, 37K/172M, 80K/174M, 212K/1T, and 213K/1T, where the first part represents cases and the second part represents population. On the right, the data is transformed by separating rate into two distinct columns labeled cases and pop. Each row now shows the same country and year with the cases value (for example 0.7K, 2K, 37K) in one column and the corresponding population value (19M, 20M, 172M, etc.) in the other. The visual demonstrates how splitting a combined column into multiple columns clarifies the data. — A visual representation of what separating variables means for data set operations.

table3 |> 
  kable()

country	year	rate
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

table3 |>
  separate_wider_delim(cols        = rate,
                       names       = c("cases", "population"),
                       delim       = "/",
                       cols_remove = FALSE
           ) |> 
  kable()

country	year	cases	population	rate
Afghanistan	1999	745	19987071	745/19987071
Afghanistan	2000	2666	20595360	2666/20595360
Brazil	1999	37737	172006362	37737/172006362
Brazil	2000	80488	174504898	80488/174504898
China	1999	212258	1272915272	212258/1272915272
China	2000	213766	1280428583	213766/1280428583

I’ve left the rate column in the original data frame (cols_remove = FALSE) just to make it easy to compare and verify that yes, it worked.

And, of course, there is a complementary operation, which is when it’s necessary to join two columns to get a useable data value.

A diagram labeled 'table5' shows two tables side by side with an arrow pointing from left to right. On the left, a table has three columns: country, century, and year. Each country appears twice, once with century 19 and year 99, and once with century 20 and year 00. On the right, the data is transformed by uniting the century and year columns into a single year column that contains full values such as 1999 and 2000. Each row now shows the country alongside the combined year. The visual demonstrates how uniting multiple columns can merge partial values into a single meaningful column. — A visual representation of what uniting variables means for data set operations.

table5 |> 
  kable()

country	century	year	rate
Afghanistan	19	99	745/19987071
Afghanistan	20	00	2666/20595360
Brazil	19	99	37737/172006362
Brazil	20	00	80488/174504898
China	19	99	212258/1272915272
China	20	00	213766/1280428583

table5 |>
  unite(col = "year",
        c(century, year),
        sep = ''
        ) |> 
  kable()

country	year	rate
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583