ggplot2
Today we will…
ggplot2
)Tip
Name arguments in function calls
Only include necessary arguments! (If you are using any default values, no need to repeat them in your function call.)
Good
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.1
The tidyverse includes functions to:
Read in data | readr |
Visualize data | ggplot2 |
Manipulate rectangular data | tidyr , dplyr , tibble |
Handle special variable types | stringr , forcats , lubridate |
Support functional programming | purrr |
This version of the course will primarily use tidyverse packages and grammar
Reasoning:
R
at this point (in my opinion)tidyverse
packageInstalling/loading the tidyverse
package installs/loads all of the “tidyverse” packages
Avoid redundantly installing or loading packages!
Artwork by Allison Horst
Look at the file extension for the type of data file.
.csv
: “comma-separated values”
Name, Age
Bob, 49
Joe, 40
.xls, .xlsx
: Microsoft Excel spreadsheet
.csv
readxl
package.txt
: plain text
Using base R
functions:
read.csv()
is for reading in .csv
files.
read.table()
and read.delim()
are for any data with “columns” (you specify the separator).
The tidyverse has some cleaned-up versions in the readr
and readxl
packages:
read_csv()
is for comma-separated data.
read_tsv()
is for tab-separated data.
read_table()
is for white-space-separated data.
read_delim()
is any data with “columns” (you specify the separator). The above are special cases.
read_excel()
is specifically for dealing with Excel files.
Remember to load the readr
and readxl
packages first!
You have to tell R
where to “find” the data you want to read in using a file path.
Quarto automatically sets the working directory to the be directory where the Quarto document is for any code within the Quarto document
This overrides the directory set by an .Rproj
Pay attention to this when setting relative filepaths
"../"
"../data/dat.csv"
The Grammar of Graphics (GoG) is a principled way of specifying exactly how to create a particular graph from a given data set. It helps us to systematically design new graphs.
Think of a graph or a data visualization as a mapping…
FROM variables in the data set (or statistics computed from the data)…
TO visual attributes (or “aesthetics”) of marks (or “geometric elements”) on the page/screen.
data
: dataframe containing variablesaes
: aesthetic mappings (position, color, symbol, …)geom
: geometric element (point, line, bar, box, …)stat
: statistical variable transformation (identity, count, linear model, quantile, …)scale
: scale transformation (log scale, color mapping, axes tick breaks, …)coord
: Cartesian, polar, map projection, …facet
: divide into subplots using a categorical variableggplot2
Complete this template to build a basic graphic:
+
to add layers to a graphic.We map variables (columns) from the data to aesthetics on the graphic useing the aes()
function.
What aesthetics can we set (see ggplot2 cheat sheet for more)?
We map variables (columns) from the data to aesthetics on the graphic useing the aes()
function.
What aesthetics can we set (see ggplot2 cheat sheet for more)?
Global Aesthetics
Mapping Aesthetics
We use a geom_xxx()
function to represent data points.
one variable
geom_density()
geom_dotplot()
geom_histogram()
geom_boxplot()
two variable
geom_point()
geom_line()
geom_density_2d()
three variable
geom_contour()
geom_raster()
Not an exhaustive list – see ggplot2 cheat sheet.
To create a specific type of graphic, we will combine aesthetics and geometric objects.
Let’s try it!
What: Game Plans! are strategic guides that prompt you to map your coding strategies before implementation.
How: Your favorite sketch app, paper + pencil, online whiteboard (Excalidraw!).
Why: Tool to connect data and desired graphic before you start coding
Start with the TX housing data.
Make a plot of median house price over time (including both individual data points and a smoothed trend line), distinguishing between different cities.
Extracts subsets of data and places them in side-by-side plots.
Extracts subsets of data and places them in side-by-side plots.
facet_grid(cols = vars(b))
: facet into columns based on bfacet_grid(rows = vars(a))
: facet into rows based on afacet_grid(rows = vars(a), cols = vars(b))
: facet into both rows and columnsfacet_wrap(vars(b))
: wrap facets into a rectangular layoutYou can set scales to let axis limits vary across facets:
"fixed"
– default, x- and y-axis limits are the same for each facet"free"
– both x- and y-axis limits adjust to individual facets"free_x"
– only x-axis limits adjust"free_y"
– only y-axis limits adjustYou can set a labeller to adjust facet labels.
Include both the variable name and factor name in the labels:
facet_grid(cols = vars(b), labeller = label_both)
Display math symbols in the labels:
facet_grid(cols = vars(b), labeller = label_bquote(cols = alpha ^ .(b)))
facet_grid(cols = vars(b), labeller = label_parsed)
Including the variable and facet names using label_both
:
Including math labels in facet names using label_bquote
:
stat
A stat
transforms an existing variable into a new variable to plot.
identity
leaves the data as is.count
counts the number of observations.summary
allows you to specify a desired transformation function.Sometimes these statistical transformations happen under the hood when we call a geom
.
stat
Position adjustments determine how to arrange geom
’s that would otherwise occupy the same space.
position = 'dodge'
: Arrange elements side by side.position = 'fill'
: Stack elements on top of one another + normalize height.position = 'stack'
: Stack elements on top of one another.position = 'jitter"
: Add random noise to X & Y position of each element to avoid overplotting (see geom_jitter()
). It is good practice to put each geom
and aes
on a new line.
How would you make this plot from the diamonds
dataset in ggplot2
?
data
aes
geom
facet
There are a lot of pieces to put together when creating a good graphic.
This game plan should include:
geom
do you need?aes
’s do you need?Use the mpg
dataset to create two side-by-side scatterplots of city MPG vs. highway MPG where the points are colored by the drive type (drv). The two plots should be separated by year.
Artwork by Allison Horst
ggplot2