Writing Data Functions

Monday, May 18

Today we will…

  • Group Quiz 7
  • New Material
    • Calling Functions on Datasets
    • Tidy Evaluation

Common Final

  • Monday June, 8th 7-10pm
  • Building 180 Room 102
  • Email me today if you have a conflict

Writing Data Frame Functions

Moving Beyond Vectors

This week, we’re writing functions that take a data frame and variable names as arguments.

These functions can be incredibly powerful, but they require us to learn some interesting details about how some of the functions we’ve grown very accustomed to (e.g., select(), mutate(), group_by()) work “behind the scenes.”

Function to Standardize Data

We want to take in a vector of numbers and standardize it. One form of standardization is ensuring that the mean is 0 and standard deviation is 1.

std_vec <- function(var) {
  stopifnot(is.numeric(var))
  
  (var - mean(var, na.rm = T))/sd(var, na.rm = T)
}

Standardizing Variables

Is it a good idea to standardize (scale) variables in a data analysis?

Why standardize?

  • Easier to compare across variables.
  • Easier to model – standardizes the amount of variability.

Why not standardize?

  • More difficult to interpret the values.

Pair Our Function with dplyr

Let’s standardize penguin measurements.

penguins |> 
  mutate(bill_length_mm    = std_vec(bill_length_mm), 
         bill_depth_mm     = std_vec(bill_depth_mm), 
         flipper_length_mm = std_vec(flipper_length_mm), 
         body_mass_g       = std_vec(body_mass_g))

That’s nice, but our function must be combined with mutate() now to work…

Write the function like a dplyr function?

std_column <- function(data, var) {
  stopifnot(is.data.frame(data))
  
  data |> 
    mutate(var = std_vec(var))
}

Note

I used the existing function std_vec() inside the new function for clarity!

But it didn’t work…

penguins |> 
  std_column(var = body_mass_g)
Error in `mutate()`:
ℹ In argument: `var = std_vec(var)`.
Caused by error:
! object 'body_mass_g' not found

Tidy Evaluation

Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.

Tidy:

penguins |> 
  pull(body_mass_g)

  OR

penguins$body_mass_g

Untidy:

penguins[, "body_mass_g"]

  OR

penguins[["body_mass_g"]]


tidy evaluation is not supported in writing your own functions

Solution 1

Don’t use tidy evaluation in your own functions.

  • This is more complicated to read and use, but it’s safe.
std_column <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data[[variable]] <- std_vec(data[[variable]])
  
  return(data)
}

std_column(penguins, "bill_length_mm")

Solution 2: rlang

Use the rlang package!

  • This package provides operators that simplify writing functions around tidyverse pipelines.

  • Note: you do not need to load rlang
  • Read more about using this package for function writing here!

Indirection

The tidyverse functions use either “tidy selection” or “data masking.” Both of these features makes common tasks easier at the cost of making less commons tasks harder.

Clarifying our language

Blurs the line between the two different meanings of the word “variable”:

  • env-variables – “programming” variables that live in an environment
    • These are typically created using a <-.
  • data-variables — “statistical” variables that live in a data frame.
    • These come from data files or are created manipulating existing variables.

Embrace Operator

In the case of our function, the name of the columns we want to use are stored in an intermediate variable (e.g., var = bill_length_mm).

When you have a data-variable in a function argument, you need to embrace the argument.

summarize(mean({{ var }}))
select({{ var }})

Walrus Operator

If you want to create a data-variable where the name is a user-provided function, argument, you need to use the the walrus operator (:=)

mutate({{ var }} := {{ var2 }} / 3)

Recall Our Broken Function

std_column <- function(data, var) {
  stopifnot(is.data.frame(data))
  
  data |> 
    mutate(var = std_vec(var))
}

penguins |> 
  std_column(body_mass_g)
Error in `mutate()`:
ℹ In argument: `var = std_vec(var)`.
Caused by error:
! object 'body_mass_g' not found
  • mutate() doesn’t know what body_mass_g is.
  • We need to embrace var so that mutate() knows to look for body_mass_g as a data-variable

Fixing Our Broken Function

Use the embrace operator:

std_column <- function(data, var) {
  stopifnot(is.data.frame(data))
  
  data |> 
    mutate({{ var }} := std_vec({{ var }}))
}

penguins |> 
  std_column(body_mass_g) |> 
  slice_head(n = 5)
# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181      -0.563
2 Adelie  Torgersen           39.5          17.4               186      -0.501
3 Adelie  Torgersen           40.3          18                 195      -1.19 
4 Adelie  Torgersen           NA            NA                  NA      NA    
5 Adelie  Torgersen           36.7          19.3               193      -0.937
# ℹ 2 more variables: sex <fct>, year <int>

Now you can create your own tidy functions!

PA 8.1: Data Frame Functions

You will write a tidy function for a contingency table like the table() function in base R.

Digital illustration of a cute fuzzy monster holding a brief case that says tidy data, standing beside a friendly looking data table character, being welcomed with cheers by many other data tables and another cute monster jumping with joy.

Allison Horst

Collaborative Protocol

During your collaboration, your group will alternate between three roles:

  • Reads out the prompt and ensures the group understands what is being asked.
  • Manages resources (e.g., cheatsheets, textbook).
  • Answers Coder’s questions about syntax based on the resources.
  • Works with the group to debug the code.
  • Encourages the Coder to vocalize and explain their thinking.
  • Types the code specified by the Coder into the Quarto document.
  • Runs the code provided by the Coder.
  • Works with group to debug the code.
  • Evaluates the output against the question prompt.
  • Confirms they understand what the prompt is asking.
  • Talks with the group about their ideas.
  • Explains their thinking.
  • Directs the Computer what to type.
  • Works with the group to debug the code.

Submission

  • Instead of a simple puzzle answer to this one, you will submit a screenshot of your function code . . .

Starting Roles Today

The person who lives closest to campus starts as the coder, second as the project manager.

To do…

  • PA 8.1: Data Frame Functions
    • Due Tuesday 5/19 at 11:59pm.
  • Project Checkpoint 3: Progress Report
    • Due Friday, 5/22 at 11:59pm.
  • Lab 8: Data Frame Functions + Simulation
    • Due Tuesday 5/26 at 11:59pm.