Writing Functions

Thursday, May 15

Today we will…

  • New Material
    • Calling Functions on Datasets
    • rlang Tidy Evaluation
    • Missing Data
  • Lab 7: Functions + Fish

Calling Functions on Datasets

Pair Our Function with dplyr

Consider the penguins Data

library(palmerpenguins)
data(penguins)
penguins |> 
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Function to Standardize Data

We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.

std_to_01 <- function(var) {
  stopifnot(is.numeric(var))
  
  num <- var - min(var, na.rm = TRUE)
  denom <- max(var, na.rm = TRUE) - min(var, na.rm = TRUE)
  
  return(num / denom)
}

Standardizing Variables

Is it a good idea to standardize (scale) variables in a data analysis?

Why standardize?

  • Easier to compare across variables.
  • Easier to model – standardizes the amount of variability.

Why not standardize?

  • More difficult to interpret the values.

E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).

Pair Our Function with dplyr

Let’s standardize penguin measurements.

penguins |> 
  mutate(bill_length_mm    = std_to_01(bill_length_mm), 
         bill_depth_mm     = std_to_01(bill_depth_mm), 
         flipper_length_mm = std_to_01(flipper_length_mm), 
         body_mass_g       = std_to_01(body_mass_g))
  • Ugh. Still copy-pasting!

Recall across()!

penguins |> 
  mutate(across(.cols = bill_length_mm:body_mass_g,
                .fns = ~ std_to_01(.x))) |> 
  slice_head(n = 4)
# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
# ℹ 2 more variables: sex <fct>, year <int>

Use variables as function arguments?

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}

Note

I used the existing function std_to_01() inside the new function for clarity!

But it didn’t work…

std_column_01(penguins, body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found

Tidy Evaluation

Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.

Tidy:

penguins |> 
  pull(body_mass_g)

  OR

penguins$body_mass_g

Untidy:

penguins[, "body_mass_g"]

  OR

penguins[["body_mass_g"]]


Tidy evaluation isn’t naturally supported when writing your own functions.

Defused R Code

When a piece of code is defused, R doesn’t return its value like normal.

  • Instead it returns an expression that describes how to evaluate it.

Evaluated code:

1 + 1
[1] 2

Defused code:

expr(1 + 1)
1 + 1

We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.

Solution 1

Don’t use tidy evaluation in your own functions.

  • This is more complicated to read and use, but it’s safe.
std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data[[variable]] <- std_to_01(data[[variable]])
  return(data)
}

std_column_01(penguins, "bill_length_mm")

Solution 2: rlang

Use the rlang package!

  • This package provides operators that simplify writing functions around tidyverse pipelines.

  • Read more about using this package for function writing here!

Solution 2: rlang

Two ways to get around the issue of defused code:

  1. Embrace Operator ({ })
  • With { }, you can transport a variable from one function to another.
  1. Defuse and Inject
  • You can first use enquo(arg) to defuse the variable.
  • Then use !!arg to inject the variable.

Note

I am going to just focus on using the embrace operator for the rest of class, but know what difuse/inject is another option!

Solution 2: rlang

If we use either of these solutions, we also need to use the walrus operator (:=).

  • This means we have to use := instead of = in any dplyr verb containing one of these rlang fixes.

Recall Our Broken Function

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}

std_column_01(penguins, body_mass_g) |> 
  slice_head(n = 5)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
  • The code is defused, so mutate() doesn’t know what body_mass_g is.
  • We need to modify variable to make this work correctly!

Fixing Our Broken Function

Use the Embrace Operator:

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))

  data <- data |>
    mutate({{variable}} := std_to_01({{variable}}))
  return(data)
}

std_column_01(penguins, body_mass_g) |> 
  slice_head(n = 5)
# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181       0.292
2 Adelie  Torgersen           39.5          17.4               186       0.306
3 Adelie  Torgersen           40.3          18                 195       0.153
4 Adelie  Torgersen           NA            NA                  NA      NA    
5 Adelie  Torgersen           36.7          19.3               193       0.208
# ℹ 2 more variables: sex <fct>, year <int>

Transport Multiple Variables

What if I want to modify multiple columns?

  • Use across()!
std_column_01 <- function(data, variables) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(across(.cols = {{variables}},
                  .fns = ~ std_to_01(.x)))
  return(data)
}

std_column_01(penguins, bill_length_mm:body_mass_g) |> 
  slice_head(n = 5)
# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
5 Adelie  Torgersen          0.167         0.738             0.356       0.208
# ℹ 2 more variables: sex <fct>, year <int>

Missing Data

Types of Missing Data

  1. Missing Completely at Random (MCAR)
    • No difference between missing and observed values.
    • Missing observations are a random subset of all observations.
  2. Missing at Random (MAR)
    • Systematic difference between missing and observed values, but can be entirely explained by other observed variables.
  3. Missing Not at Random (MNAR)
    • Missingness is directly related to the unobserved value.

Types of Missing Data

Consider a study of depression.

  1. Missing Completely at Random (MCAR)
    • Some subjects have missing lab values because a batch of samples was processed improperly.
  2. Missing at Random (MAR)
    • Subjects who identify as men are less likely to complete a survey on depression severity.
  3. Missing Not at Random (MNAR)
    • Subjects with more severe depression are less likely to complete a survey on depression severity.

When we remove missing data…

We implicitly assume observations are missing completely at random!

  • We might be mostly removing data from subjects who identify as men.
  • We might be mostly removing data from subjects with severe depression.
  • We are inadvertently making our data less representative.

We need to take more care when dealing with missing values!

Dealing with Missing Data

  • Look for patterns!
    • Do observations with missing values have similar traits?
  • Consider outside explanations!
    • Why might missing data exist?
    • Should we have a “missing” category in our analysis?
  • Can we impute values?
    • If depression is MCAR within gender, age, and education level, then the distribution of depression will be similar for people of the same gender, age, and education level.

Lab 7: Functions + Fish

To do…

  • Final Project Group Contract
    • Due Tomorrow Friday, 5/16 at 11:59pm.
  • Lab 7: Functions + Fish
    • Due Monday 5/19 at 11:59pm.
  • Read Chapter 8: Iteration and Simulation
    • Check-in 8.1 & 8.2 due Tuesday 5/20 before class.