Importing Data and Graphics with `ggplot2`

Tuesday, April 8

Today we will…

Style Note of the Day
New Material
- Welcome to the Tidyverse
- Load External Data
- Graphics (and ggplot2)
- Game Planning
PA 2: Using Data Visualization to Find the Penguins

Style Note of the Day - Function Calls

Tip

Name arguments in function calls

Only include necessary arguments! (If you are using any default values, no need to repeat them in your function call.)

Good

mean(1:10, na.rm = TRUE)
seq(from = 1, to = 100, by = 5)

Bad

mean(1:10, , TRUE)
mean(1:10, trim = 0, na.rm = TRUE)

seq(1, 100, 5)

Welcome to the Tidyverse

Tidywho?

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.¹

Most of the functionality you will need for an entire data analysis workflow with cohesive grammar

Core Packages

The tidyverse includes functions to:

Read in data	`readr`
Visualize data	`ggplot2`
Manipulate rectangular data	`tidyr`, `dplyr`, `tibble`
Handle special variable types	`stringr`, `forcats` , `lubridate`
Support functional programming	`purrr`

Tidyverse and STAT 331

This version of the course will primarily use tidyverse packages and grammar
Reasoning:
- the tidyverse is as reputable and ubiquitous as base R at this point (in my opinion)
- the tidyverse is specifically designed to help programmers produce easy-to-read and reproducible analyses and to reduce errors
- there is excellent documentation!
- I like it!

Using the `tidyverse` package

Installing/loading the tidyverse package installs/loads all of the “tidyverse” packages
Avoid redundantly installing or loading packages!

Do this:

library(tidyverse)

library(readr)

Not this:

library(tidyverse)
library(readr)

Tidy Data

Artwork by Allison Horst

Working with External Data

Data Science Workflow

Common Types of Data Files

Look at the file extension for the type of data file.

.csv : “comma-separated values”

Name, Age
Bob, 49
Joe, 40

.xls, .xlsx: Microsoft Excel spreadsheet

Common approach: save as .csv
Nicer approach: use the readxl package

.txt: plain text

Could have any sort of delimiter…
Need to let R know what to look for!

Loading External Data

Using base R functions:

read.csv() is for reading in .csv files.
read.table() and read.delim() are for any data with “columns” (you specify the separator).

Loading External Data

The tidyverse has some cleaned-up versions in the readr and readxl packages:

read_csv() is for comma-separated data.
read_tsv() is for tab-separated data.
read_table() is for white-space-separated data.
read_delim() is any data with “columns” (you specify the separator). The above are special cases.
read_excel() is specifically for dealing with Excel files.

Remember to load the readr and readxl packages first!

Take a look at the documentation

Reminder: Notebooks and File Paths

You have to tell R where to “find” the data you want to read in using a file path.
Quarto automatically sets the working directory to the be directory where the Quarto document is for any code within the Quarto document
This overrides the directory set by an .Rproj

Pay attention to this when setting relative filepaths
- To “backout” of one directory, use "../"
- e.g.: "../data/dat.csv"

Grammar of Graphics

Why Do We Create Graphics?

Grammar of Graphics

The Grammar of Graphics (GoG) is a principled way of specifying exactly how to create a particular graph from a given data set. It helps us to systematically design new graphs.

Think of a graph or a data visualization as a mapping…

FROM variables in the data set (or statistics computed from the data)…
TO visual attributes (or “aesthetics”) of marks (or “geometric elements”) on the page/screen.

Why Grammar of Graphics?

It’s more flexible than a “chart zoo” of named graphs.
The software understands the structure of your graph.
It easily automates graphing of data subsets.

Components of Grammar of Graphics

data: dataframe containing variables
aes : aesthetic mappings (position, color, symbol, …)
geom : geometric element (point, line, bar, box, …)
stat : statistical variable transformation (identity, count, linear model, quantile, …)
scale : scale transformation (log scale, color mapping, axes tick breaks, …)
coord : Cartesian, polar, map projection, …
facet : divide into subplots using a categorical variable

Using `ggplot2`

How to Build a Graphic

Complete this template to build a basic graphic:

We use + to add layers to a graphic.

Add data
Add aesthetics
Add one geom per layer

This begins a plot that you can add layers to:

ggplot(data = mpg)

ggplot(data = mpg, 
       aes(x = class, y = hwy))

ggplot(data = mpg, 
       aes(x = class, y = hwy)) +
  geom_jitter()

ggplot(data = mpg, 
       aes(x = class, y = hwy)) +
  geom_jitter() +
  geom_boxplot()

How would you make the points be on top of the boxplots?

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

x, y
color, fill
linetype
lineend
size
shape

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

x, y
color, fill
linetype
lineend
size
shape

Global v. Local Aesthetics

Global Aesthetics

ggplot(data = mpg, 
       mapping = aes(x = class, 
                     y = hwy)) +
  geom_boxplot()

Local Aesthetics

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, 
                             y = hwy))

Mapping v. Setting Aesthetics

Mapping Aesthetics

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = class, 
                             y = hwy,
                             color = class))

Setting Aesthetics

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = class, 
                             y = hwy),
               color = "steelblue")

Geometric Objects

We use a geom_xxx() function to represent data points.

one variable

geom_density()
geom_dotplot()
geom_histogram()
geom_boxplot()

two variable

geom_point()
geom_line()
geom_density_2d()

three variable

geom_contour()
geom_raster()

Not an exhaustive list – see ggplot2 cheat sheet.

geom_point()
geom_text()
geom_line()

Code

ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)) +
  geom_point() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code

ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)) +
  geom_text(aes(label = class)) +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code

ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)) +
  geom_line() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Creating a Graphic

To create a specific type of graphic, we will combine aesthetics and geometric objects.

Let’s try it!

Game Planning

What: Game Plans! are strategic guides that prompt you to map your coding strategies before implementation.

How: Your favorite sketch app, paper + pencil, online whiteboard (Excalidraw!).

Why: Tool to connect data and desired graphic before you start coding

The Goal
Game Plan
ggplot

Start with the TX housing data.

Make a plot of median house price over time (including both individual data points and a smoothed trend line), distinguishing between different cities.

Code

ggplot(data = sm_tx,
       aes(x = date, y = median, color = city)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Faceting

Extracts subsets of data and places them in side-by-side plots.

facet_grid()
facet_wrap()

Code

ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_grid(cols = vars(city)) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Code

ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_wrap(vars(city)) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Faceting

Extracts subsets of data and places them in side-by-side plots.

Options
Scales
Labels

facet_grid(cols = vars(b)): facet into columns based on b
facet_grid(rows = vars(a)): facet into rows based on a
facet_grid(rows = vars(a), cols = vars(b)): facet into both rows and columns
facet_wrap(vars(b)): wrap facets into a rectangular layout

You can set scales to let axis limits vary across facets:

facet_grid(rows = vars(a),
           cols = vars(b),
           scales = ______)

"fixed" – default, x- and y-axis limits are the same for each facet
"free" – both x- and y-axis limits adjust to individual facets
"free_x" – only x-axis limits adjust
"free_y" – only y-axis limits adjust

You can set a labeller to adjust facet labels.

Include both the variable name and factor name in the labels:

facet_grid(cols = vars(b), labeller = label_both)

Display math symbols in the labels:

facet_grid(cols = vars(b), labeller = label_bquote(cols = alpha ^ .(b)))
facet_grid(cols = vars(b), labeller = label_parsed)

Example Facet Labels

Example 1
Example 2

Including the variable and facet names using label_both:

Code

ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_grid(cols = vars(city),
             labeller = label_both) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Including math labels in facet names using label_bquote:

Code

ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_grid(cols = vars(city),
             labeller = label_bquote(cols = .(city)^2)) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Statistical Transformation: `stat`

A stat transforms an existing variable into a new variable to plot.

identity leaves the data as is.
count counts the number of observations.
summary allows you to specify a desired transformation function.

Sometimes these statistical transformations happen under the hood when we call a geom.

Statistical Transformation: `stat`

stat_count()
stat_summary()

ggplot(data = mpg,
       mapping = aes(x = class)) +
  geom_bar()

ggplot(data = mpg,
       mapping = aes(x = class)) +
  stat_count(geom = "bar")

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "mean") +
  scale_y_continuous(limits = c(0,45))

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "max") +
  scale_y_continuous(limits = c(0,45))

Position Adjustements

Position adjustments determine how to arrange geom’s that would otherwise occupy the same space.

position = 'dodge': Arrange elements side by side.

position = 'fill': Stack elements on top of one another + normalize height.

position = 'stack': Stack elements on top of one another.

position = 'jitter": Add random noise to X & Y position of each element to avoid overplotting (see geom_jitter()).

Position Adjustements

ggplot(mpg, aes(x = fl, fill = drv)) + 
  geom_bar(position = "_____")

Plot Customizations

Labels
Themes
Scales: Axes Ticks
Scales: Color

Code

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x = "Engine Displacement (liters)", 
       y = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency")

Code

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)", 
       ylab = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  theme_bw() +
  theme(legend.position = "bottom")

Code

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x     = "Engine Displacement (liters)",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_y_continuous("Highway MPG", 
                     limits = c(0,50),
                     breaks = seq(0,50,5))

Code

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x    = "Engine Displacement (liters)",
       y    = "Highway MPG",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_color_gradient(low = "white", high = "green4")

Formatting your Plot Code

It is good practice to put each geom and aes on a new line.

This makes code easier to read!
Generally: no line of code should be over 80 characters long.

Bad Practice
Good Practice
Somewhere In Between

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + geom_point() + theme_bw() + labs(x = "City (mpg)", y = "Highway (mpg)")

ggplot(data = mpg, 
       mapping = aes(x = cty, 
                     y = hwy, 
                     color = class)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", 
       y = "Highway (mpg)")

ggplot(data = mpg, 
       mapping = aes(x = cty, y = hwy, color = class)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", y = "Highway (mpg)")

Let’s Practice!

How would you make this plot from the diamonds dataset in ggplot2?

data
aes
geom
facet

Creating a Game Plan

There are a lot of pieces to put together when creating a good graphic.

So, when sitting down to create a plot, you should first create a game plan!

This game plan should include:

What data are you starting from?
What are your x- and y-axes?
What type(s) of geom do you need?
What other aes’s do you need?

Make a Game Plan!
Example
R Code - Baseline Plot
R Code - Formatted Plot

Use the mpg dataset to create two side-by-side scatterplots of city MPG vs. highway MPG where the points are colored by the drive type (drv). The two plots should be separated by year.

Code

ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy,
                     color = drv)) +
  geom_point() +
  facet_grid(cols = vars(year))

Code

ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy,
                     color = drv)) +
  geom_point() +
  facet_grid(cols = vars(year)) +
  labs(x = "city MPG",
       y = "highway MPG") +
  scale_color_discrete(name = "drive type",
                      labels = c("4-wheel","front","rear"))

PA 2: Using Data Visualization to Find the Penguins

Artwork by Allison Horst

To do…

PA 2: Using Data Visualization to Find the Penguins
- Due Thursday (4/10) before class
Lab 2: Exploring Rodents with ggplot2
- Due Monday (4/14) at 11:59 pm

Importing Data and Graphics with ggplot2

Tuesday, April 8

Style Note of the Day - Function Calls

Welcome to the Tidyverse

Tidywho?

Core Packages

Tidyverse and STAT 331

Using the tidyverse package

Tidy Data

Working with External Data

Data Science Workflow

Common Types of Data Files

Common Types of Data Files

Loading External Data

Loading External Data

Take a look at the documentation

Reminder: Notebooks and File Paths

Grammar of Graphics

Why Do We Create Graphics?

Grammar of Graphics

Why Grammar of Graphics?

Components of Grammar of Graphics

Using ggplot2

How to Build a Graphic

Aesthetics

Aesthetics

Global v. Local Aesthetics

Mapping v. Setting Aesthetics

Geometric Objects

Creating a Graphic

Game Planning

Faceting

Faceting

Example Facet Labels

Statistical Transformation: stat

Statistical Transformation: stat

Position Adjustements

Position Adjustements

Plot Customizations

Formatting your Plot Code

Let’s Practice!

Creating a Game Plan

PA 2: Using Data Visualization to Find the Penguins

To do…

Importing Data and Graphics with `ggplot2`

Using the `tidyverse` package

Using `ggplot2`

Statistical Transformation: `stat`

Statistical Transformation: `stat`