Writing Functions

Thursday, May 15

Today we will…

New Material
- Calling Functions on Datasets
- rlang Tidy Evaluation
- Missing Data
Lab 7: Functions + Fish

Calling Functions on Datasets

Pair Our Function with `dplyr`

Consider the penguins Data

library(palmerpenguins)
data(penguins)
penguins |> 
  head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Function to Standardize Data

We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.

std_to_01 <- function(var) {
  stopifnot(is.numeric(var))
  
  num <- var - min(var, na.rm = TRUE)
  denom <- max(var, na.rm = TRUE) - min(var, na.rm = TRUE)
  
  return(num / denom)
}

Standardizing Variables

Is it a good idea to standardize (scale) variables in a data analysis?

Why standardize?

Easier to compare across variables.
Easier to model – standardizes the amount of variability.

Why not standardize?

More difficult to interpret the values.

E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).

Pair Our Function with `dplyr`

Let’s standardize penguin measurements.

penguins |> 
  mutate(bill_length_mm    = std_to_01(bill_length_mm), 
         bill_depth_mm     = std_to_01(bill_depth_mm), 
         flipper_length_mm = std_to_01(flipper_length_mm), 
         body_mass_g       = std_to_01(body_mass_g))

Ugh. Still copy-pasting!

Recall across()!

penguins |> 
  mutate(across(.cols = bill_length_mm:body_mass_g,
                .fns = ~ std_to_01(.x))) |> 
  slice_head(n = 4)

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
# ℹ 2 more variables: sex <fct>, year <int>

Use variables as function arguments?

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}

Note

I used the existing function std_to_01() inside the new function for clarity!

But it didn’t work…

std_column_01(penguins, body_mass_g)

Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found

Tidy Evaluation

Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.

Tidy:

penguins |> 
  pull(body_mass_g)

penguins$body_mass_g

Untidy:

penguins[, "body_mass_g"]

penguins[["body_mass_g"]]

Tidy evaluation isn’t naturally supported when writing your own functions.

Defused R Code

When a piece of code is defused, R doesn’t return its value like normal.

Instead it returns an expression that describes how to evaluate it.

Evaluated code:

1 + 1

[1] 2

Defused code:

expr(1 + 1)

1 + 1

We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.

Solution 1

Don’t use tidy evaluation in your own functions.

This is more complicated to read and use, but it’s safe.

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data[[variable]] <- std_to_01(data[[variable]])
  return(data)
}

std_column_01(penguins, "bill_length_mm")

Solution 2: `rlang`

Use the rlang package!

This package provides operators that simplify writing functions around tidyverse pipelines.

Read more about using this package for function writing here!

Solution 2: `rlang`

Two ways to get around the issue of defused code:

Embrace Operator ({ })

With { }, you can transport a variable from one function to another.

Defuse and Inject

You can first use enquo(arg) to defuse the variable.
Then use !!arg to inject the variable.

Note

I am going to just focus on using the embrace operator for the rest of class, but know what difuse/inject is another option!

Solution 2: `rlang`

If we use either of these solutions, we also need to use the walrus operator (:=).

This means we have to use := instead of = in any dplyr verb containing one of these rlang fixes.

Recall Our Broken Function

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}

std_column_01(penguins, body_mass_g) |> 
  slice_head(n = 5)

Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found

The code is defused, so mutate() doesn’t know what body_mass_g is.
We need to modify variable to make this work correctly!

Fixing Our Broken Function

Use the Embrace Operator:

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))

  data <- data |>
    mutate({{variable}} := std_to_01({{variable}}))
  return(data)
}

std_column_01(penguins, body_mass_g) |> 
  slice_head(n = 5)

# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181       0.292
2 Adelie  Torgersen           39.5          17.4               186       0.306
3 Adelie  Torgersen           40.3          18                 195       0.153
4 Adelie  Torgersen           NA            NA                  NA      NA    
5 Adelie  Torgersen           36.7          19.3               193       0.208
# ℹ 2 more variables: sex <fct>, year <int>

Transport Multiple Variables

What if I want to modify multiple columns?

Use across()!

std_column_01 <- function(data, variables) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(across(.cols = {{variables}},
                  .fns = ~ std_to_01(.x)))
  return(data)
}

std_column_01(penguins, bill_length_mm:body_mass_g) |> 
  slice_head(n = 5)

# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
5 Adelie  Torgersen          0.167         0.738             0.356       0.208
# ℹ 2 more variables: sex <fct>, year <int>

Missing Data

Types of Missing Data

Missing Completely at Random (MCAR)
- No difference between missing and observed values.
- Missing observations are a random subset of all observations.
Missing at Random (MAR)
- Systematic difference between missing and observed values, but can be entirely explained by other observed variables.
Missing Not at Random (MNAR)
- Missingness is directly related to the unobserved value.

Types of Missing Data

Consider a study of depression.

Missing Completely at Random (MCAR)
- Some subjects have missing lab values because a batch of samples was processed improperly.
Missing at Random (MAR)
- Subjects who identify as men are less likely to complete a survey on depression severity.
Missing Not at Random (MNAR)
- Subjects with more severe depression are less likely to complete a survey on depression severity.

When we remove missing data…

We implicitly assume observations are missing completely at random!

We might be mostly removing data from subjects who identify as men.
We might be mostly removing data from subjects with severe depression.
We are inadvertently making our data less representative.

We need to take more care when dealing with missing values!

Dealing with Missing Data

Look for patterns!
- Do observations with missing values have similar traits?

Consider outside explanations!
- Why might missing data exist?
- Should we have a “missing” category in our analysis?

Can we impute values?
- If depression is MCAR within gender, age, and education level, then the distribution of depression will be similar for people of the same gender, age, and education level.

Lab 7: Functions + Fish

To do…

Final Project Group Contract
- Due Tomorrow Friday, 5/16 at 11:59pm.
Lab 7: Functions + Fish
- Due Monday 5/19 at 11:59pm.
Read Chapter 8: Iteration and Simulation
- Check-in 8.1 & 8.2 due Tuesday 5/20 before class.

Writing Functions

Thursday, May 15

Calling Functions on Datasets

Pair Our Function with dplyr

Function to Standardize Data

Standardizing Variables

Pair Our Function with dplyr

Use variables as function arguments?

Tidy Evaluation

Defused R Code

Solution 1

Solution 2: rlang

Solution 2: rlang

Solution 2: rlang

Recall Our Broken Function

Fixing Our Broken Function

Transport Multiple Variables

Missing Data

Types of Missing Data

Types of Missing Data

When we remove missing data…

Dealing with Missing Data

Lab 7: Functions + Fish

To do…

Pair Our Function with `dplyr`

Pair Our Function with `dplyr`

Solution 2: `rlang`

Solution 2: `rlang`

Solution 2: `rlang`