Importing Data and Graphics with ggplot2

Tuesday, April 8

Today we will…

  • Style Note of the Day
  • New Material
    • Welcome to the Tidyverse
    • Load External Data
    • Graphics (and ggplot2)
    • Game Planning
  • PA 2: Using Data Visualization to Find the Penguins

Style Note of the Day - Function Calls

Tip

Name arguments in function calls

Only include necessary arguments! (If you are using any default values, no need to repeat them in your function call.)

Good

mean(1:10, na.rm = TRUE)
seq(from = 1, to = 100, by = 5)

Bad

mean(1:10, , TRUE)
mean(1:10, trim = 0, na.rm = TRUE)

seq(1, 100, 5)

Welcome to the Tidyverse

Tidywho?

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.1

  • Most of the functionality you will need for an entire data analysis workflow with cohesive grammar

Core Packages

The tidyverse includes functions to:

Read in data readr
Visualize data ggplot2
Manipulate rectangular data tidyr, dplyr, tibble
Handle special variable types stringr, forcats , lubridate
Support functional programming purrr

Tidyverse and STAT 331

  • This version of the course will primarily use tidyverse packages and grammar

  • Reasoning:

    • the tidyverse is as reputable and ubiquitous as base R at this point (in my opinion)
    • the tidyverse is specifically designed to help programmers produce easy-to-read and reproducible analyses and to reduce errors
    • there is excellent documentation!
    • I like it!

Using the tidyverse package

  • Installing/loading the tidyverse package installs/loads all of the “tidyverse” packages

  • Avoid redundantly installing or loading packages!

Do this:

library(tidyverse)

or

library(readr)

Not this:

library(tidyverse)
library(readr)

Tidy Data

Artwork by Allison Horst

Working with External Data

Data Science Workflow

Common Types of Data Files

Look at the file extension for the type of data file.

.csv : “comma-separated values”

Name, Age
Bob, 49
Joe, 40

.xls, .xlsx: Microsoft Excel spreadsheet

  • Common approach: save as .csv
  • Nicer approach: use the readxl package

.txt: plain text

  • Could have any sort of delimiter…
  • Need to let R know what to look for!

Common Types of Data Files

Loading External Data

Using base R functions:

  • read.csv() is for reading in .csv files.

  • read.table() and read.delim() are for any data with “columns” (you specify the separator).

Loading External Data

The tidyverse has some cleaned-up versions in the readr and readxl packages:

  • read_csv() is for comma-separated data.

  • read_tsv() is for tab-separated data.

  • read_table() is for white-space-separated data.

  • read_delim() is any data with “columns” (you specify the separator). The above are special cases.

  • read_excel() is specifically for dealing with Excel files.

Remember to load the readr and readxl packages first!

Take a look at the documentation

Reminder: Notebooks and File Paths

  • You have to tell R where to “find” the data you want to read in using a file path.

  • Quarto automatically sets the working directory to the be directory where the Quarto document is for any code within the Quarto document

  • This overrides the directory set by an .Rproj

  • Pay attention to this when setting relative filepaths

    • To “backout” of one directory, use "../"
    • e.g.: "../data/dat.csv"

Grammar of Graphics

Why Do We Create Graphics?

Grammar of Graphics

The Grammar of Graphics (GoG) is a principled way of specifying exactly how to create a particular graph from a given data set. It helps us to systematically design new graphs.


Think of a graph or a data visualization as a mapping…

  • FROM variables in the data set (or statistics computed from the data)…

  • TO visual attributes (or “aesthetics”) of marks (or “geometric elements”) on the page/screen.

Why Grammar of Graphics?

  • It’s more flexible than a “chart zoo” of named graphs.
  • The software understands the structure of your graph.
  • It easily automates graphing of data subsets.

Components of Grammar of Graphics

  • data: dataframe containing variables
  • aes : aesthetic mappings (position, color, symbol, …)
  • geom : geometric element (point, line, bar, box, …)
  • stat : statistical variable transformation (identity, count, linear model, quantile, …)
  • scale : scale transformation (log scale, color mapping, axes tick breaks, …)
  • coord : Cartesian, polar, map projection, …
  • facet : divide into subplots using a categorical variable

Using ggplot2

How to Build a Graphic

Complete this template to build a basic graphic:


  • We use + to add layers to a graphic.

This begins a plot that you can add layers to:

ggplot(data = mpg)

ggplot(data = mpg, 
       aes(x = class, y = hwy))

ggplot(data = mpg, 
       aes(x = class, y = hwy)) +
  geom_jitter()

ggplot(data = mpg, 
       aes(x = class, y = hwy)) +
  geom_jitter() +
  geom_boxplot()

How would you make the points be on top of the boxplots?

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

  • x, y
  • color, fill
  • linetype
  • lineend
  • size
  • shape

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

  • x, y
  • color, fill
  • linetype
  • lineend
  • size
  • shape

Global v. Local Aesthetics

Global Aesthetics

ggplot(data = mpg, 
       mapping = aes(x = class, 
                     y = hwy)) +
  geom_boxplot()

Local Aesthetics

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, 
                             y = hwy))

Mapping v. Setting Aesthetics

Mapping Aesthetics

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = class, 
                             y = hwy,
                             color = class))

Setting Aesthetics

ggplot(data = mpg) +
  geom_jitter(mapping = aes(x = class, 
                             y = hwy),
               color = "steelblue")

Geometric Objects

We use a geom_xxx() function to represent data points.

one variable

  • geom_density()
  • geom_dotplot()
  • geom_histogram()
  • geom_boxplot()

two variable

  • geom_point()
  • geom_line()
  • geom_density_2d()

three variable

  • geom_contour()
  • geom_raster()

Not an exhaustive list – see ggplot2 cheat sheet.

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)) +
  geom_point() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)) +
  geom_text(aes(label = class)) +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)) +
  geom_line() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Creating a Graphic

To create a specific type of graphic, we will combine aesthetics and geometric objects.


Let’s try it!

Game Planning

What: Game Plans! are strategic guides that prompt you to map your coding strategies before implementation.

How: Your favorite sketch app, paper + pencil, online whiteboard (Excalidraw!).

Why: Tool to connect data and desired graphic before you start coding

Start with the TX housing data.

Make a plot of median house price over time (including both individual data points and a smoothed trend line), distinguishing between different cities.

Code
ggplot(data = sm_tx,
       aes(x = date, y = median, color = city)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Faceting

Extracts subsets of data and places them in side-by-side plots.

Code
ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_grid(cols = vars(city)) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Code
ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_wrap(vars(city)) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Faceting

Extracts subsets of data and places them in side-by-side plots.

  • facet_grid(cols = vars(b)): facet into columns based on b
  • facet_grid(rows = vars(a)): facet into rows based on a
  • facet_grid(rows = vars(a), cols = vars(b)): facet into both rows and columns
  • facet_wrap(vars(b)): wrap facets into a rectangular layout

You can set scales to let axis limits vary across facets:

facet_grid(rows = vars(a),
           cols = vars(b),
           scales = ______)

  • "fixed" – default, x- and y-axis limits are the same for each facet
  • "free" – both x- and y-axis limits adjust to individual facets
  • "free_x" – only x-axis limits adjust
  • "free_y" – only y-axis limits adjust

You can set a labeller to adjust facet labels.

Include both the variable name and factor name in the labels:

  • facet_grid(cols = vars(b), labeller = label_both)

Display math symbols in the labels:

  • facet_grid(cols = vars(b), labeller = label_bquote(cols = alpha ^ .(b)))
  • facet_grid(cols = vars(b), labeller = label_parsed)

Example Facet Labels

Including the variable and facet names using label_both:

Code
ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_grid(cols = vars(city),
             labeller = label_both) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Including math labels in facet names using label_bquote:

Code
ggplot(data = sm_tx,
       aes(x = date, y = median)) + 
  geom_point() + 
  facet_grid(cols = vars(city),
             labeller = label_bquote(cols = .(city)^2)) +
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Statistical Transformation: stat

A stat transforms an existing variable into a new variable to plot.

  • identity leaves the data as is.
  • count counts the number of observations.
  • summary allows you to specify a desired transformation function.

Sometimes these statistical transformations happen under the hood when we call a geom.

Statistical Transformation: stat

ggplot(data = mpg,
       mapping = aes(x = class)) +
  geom_bar()

ggplot(data = mpg,
       mapping = aes(x = class)) +
  stat_count(geom = "bar")

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "mean") +
  scale_y_continuous(limits = c(0,45))

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "max") +
  scale_y_continuous(limits = c(0,45))

Position Adjustements

Position adjustments determine how to arrange geom’s that would otherwise occupy the same space.

  • position = 'dodge': Arrange elements side by side.
  • position = 'fill': Stack elements on top of one another + normalize height.
  • position = 'stack': Stack elements on top of one another.
  • position = 'jitter": Add random noise to X & Y position of each element to avoid overplotting (see geom_jitter()).

Position Adjustements

ggplot(mpg, aes(x = fl, fill = drv)) + 
  geom_bar(position = "_____")

Plot Customizations

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x = "Engine Displacement (liters)", 
       y = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency")

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)", 
       ylab = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  theme_bw() +
  theme(legend.position = "bottom")

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x     = "Engine Displacement (liters)",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_y_continuous("Highway MPG", 
                     limits = c(0,50),
                     breaks = seq(0,50,5))

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x    = "Engine Displacement (liters)",
       y    = "Highway MPG",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_color_gradient(low = "white", high = "green4")

Formatting your Plot Code

It is good practice to put each geom and aes on a new line.

  • This makes code easier to read!
  • Generally: no line of code should be over 80 characters long.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + geom_point() + theme_bw() + labs(x = "City (mpg)", y = "Highway (mpg)")
ggplot(data = mpg, 
       mapping = aes(x = cty, 
                     y = hwy, 
                     color = class)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", 
       y = "Highway (mpg)")
ggplot(data = mpg, 
       mapping = aes(x = cty, y = hwy, color = class)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", y = "Highway (mpg)")

Let’s Practice!

How would you make this plot from the diamonds dataset in ggplot2?


  • data
  • aes
  • geom
  • facet

Creating a Game Plan

There are a lot of pieces to put together when creating a good graphic.

  • So, when sitting down to create a plot, you should first create a game plan!

This game plan should include:

  1. What data are you starting from?
  2. What are your x- and y-axes?
  3. What type(s) of geom do you need?
  4. What other aes’s do you need?

Use the mpg dataset to create two side-by-side scatterplots of city MPG vs. highway MPG where the points are colored by the drive type (drv). The two plots should be separated by year.

Code
ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy,
                     color = drv)) +
  geom_point() +
  facet_grid(cols = vars(year))

Code
ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy,
                     color = drv)) +
  geom_point() +
  facet_grid(cols = vars(year)) +
  labs(x = "city MPG",
       y = "highway MPG") +
  scale_color_discrete(name = "drive type",
                      labels = c("4-wheel","front","rear"))

PA 2: Using Data Visualization to Find the Penguins

Artwork by Allison Horst

To do…

  • PA 2: Using Data Visualization to Find the Penguins
    • Due Thursday (4/10) before class
  • Lab 2: Exploring Rodents with ggplot2
    • Due Monday (4/14) at 11:59 pm