Using stringr to Work with Strings

Monday, April 27

Today we will…

  • Group Quiz 4
  • Different schedule this week & next
  • New material
    • String variables
    • Regular expressions
    • Functions for working with strings
  • PA 5.1: Scrambled Message

Follow along

Remember to download, save, and open up the handout for today!

Week 5 Layout

  • Today: Strings with stringr
    • Practice Activity: Decoding a Message
  • Wednesday: Dates with lubridate
    • Practice Activity: Jewel Heist
    • Discuss midterm exam
  • Lab Assignment Solving a Murder Mystery
    • Using dplyr + stringr + ludridate
    • Due Sunday
    • May only use 1 late day so that I can post solutions before the exam

Week 6 Layout

  • Monday: Version control with git and GitHub
    • There are two “Course Setup” assignments to complete before class
    • Practice Activity - done in groups in class
    • Last day to submit Lab 5
  • Wednesday: Midterm Exam

String Variables

What is a string?

A string is a bunch of characters.

There is a difference between…

…a string (many characters, one object)…

and

…a character vector (vector of strings).

my_string <- "Hi, my name is Bond!"
my_string
[1] "Hi, my name is Bond!"
my_vector <- c("Hi", "my", "name", "is", "Bond")
my_vector
[1] "Hi"   "my"   "name" "is"   "Bond"

Strings in a Data Frame

For the colleges dataset from PA 3:

a string is:

colleges$INSTNM[214]

a character vector is:

colleges$INSTNM 

stringr

Common tasks

  • Identify strings containing a particular pattern.
  • Remove or replace a pattern.
  • Edit a string (e.g., make it lowercase).

Note

  • The stringr package loads with tidyverse.
  • All functions are of the form str_xxx().

string =

  • None of the stringr functions have a .data = argument!
  • These functions only accept a character vector string = as an input
str_detect(string = colleges$INSTNM, 
           pattern = "Polytechnic")
  • Thus, functions will need to be combined with functions from dplyr to work with a dataset!

pattern =

The pattern argument appears in many stringr functions.

  • The pattern must be supplied inside quotes.
str_detect(string = colleges$INSTNM, 
           pattern = "Polytechnic")

str_remove(string = colleges$INSTNM, 
           pattern = "(University|College)")

str_replace(string = colleges$INSTNM, 
            pattern = "$u", 
            replacement = "U")


Let’s talk more about what some of these symbols mean.

Regular Expressions

Regular Expressions

…are tricky!

  • There are lots of new symbols to keep straight.
  • There are a lot of cases to think through.


We’re going to focus on:

  • special characters
  • anchors
  • quantifiers
  • character classes
  • groups

Note: str_subset()

Returns a character vector containing a subset of the original character vector consisting of the elements where the pattern was found anywhere in the element.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_subset(my_vector, pattern = "Bond")
[1] "Bond"       "James Bond"

Special Characters

There is a set of characters that have a specific meaning when using regex.

  • The stringr package does not read these as normal characters.
  • These characters are:

. ^ $ \ | * + ? { } [ ] ( )

Wild Card Character: .

This character can match any character.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = ".ells")
[1] "sells"     "seashells"

Escape: \\

To match literally one of those special characters on the previous slide you need to “escape” it with \\

Use \\ to escape the . – it is now read as a normal character.

str_subset(colleges$INSTNM, pattern = "\\.")
[1] "J. F. Drake State Community and Technical College"
[2] "John F. Kennedy University"                       
[3] "Pinellas Technical College-St. Petersburg"        
[4] "St. Thomas University"                            
[5] "First Institute of Travel Inc."                   
[6] "St. John's College-Department of Nursing"         

Anchor Characters: ^ $

^ – looks at the beginning of a string.

str_subset(colleges$INSTNM, 
           pattern = "^California State")

$ – looks at the end of a string.

str_subset(colleges$INSTNM, 
           pattern = "State University$")

Character Classes: []

  • Character classes let you specify multiple possible characters to match on.
  • Anything inside [] is treated like “or”
str_subset(colleges$INSTNM, 
           pattern = "^[AC]")

Excluding Characters

[^ ] – specifies characters not to match on

str_subset(colleges$INSTNM, 
           pattern = "[^y]$")

Beware / Reminder: a ^ doesn’t always mean “not”


Starts with “U”

str_subset(colleges$INSTNM, 
           pattern = "^U")

Does not start with “U”

str_subset(colleges$INSTNM, 
           pattern = "^[^U]")

Special Character Classes: []

[ - ] – specifies a range of characters.

  • [A-Z] or [:upper:] matches any capital letter.
  • [a-z] or [:lower:] matches any lowercase letter.
  • [A-z] or [:alpha:] matches any letter
  • [0-9] or [:digit:] matches any number
  • [:punct:] matches any punctuation character
str_subset(colleges$INSTNM, 
           pattern = "[0-9]")

str_subset(colleges$INSTNM, 
           pattern = "[:punct:]$")

Quantifier Characters: + *

+ – occurs 1 or more times

str_subset(colleges$INSTNM, 
           pattern = "St\\.+")

* – occurs 0 or more times

str_subset(colleges$INSTNM, 
           pattern = "\\s*-\\s*")

Tip

"\\s" is a commonly used shortcut which matches any whitespace (space, tab, etc.)

Quantifier Characters: {}

{n} – occurs exactly n times

str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4}")

{n,m} – occurs between n and m times

str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4,6}")

Want at least 4? {4,}

Character Groups: ()

  • () creates a group of characters to be matched exactly
str_subset(colleges$INSTNM, 
           pattern = "(iss){2}")
  • We can specify “either” / “or” within a group using |.
str_subset(colleges$INSTNM, 
           pattern = "(T|t)ech")
  • We can refer to created groups with escaped numbers
str_subset(colleges$INSTNM, 
           pattern = "([aeiou])\\1")

let’s put these to use!

Detecting Patterns

str_detect()

Returns a logical vector indicating whether the pattern was found in each element of the supplied vector.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")
str_detect(my_vector, pattern = "Bond")
[1] FALSE FALSE  TRUE  TRUE
  • Pairs well with filter().
  • Works with summarise() + sum (to get total matches) or mean (to get proportion of matches).

str_detect() with filter()

Which colleges in the dataset have “Polytechnic” in their name?

colleges |> 
  filter(str_detect(INSTNM, pattern = "Polytechnic"))

How many colleges in the dataset have “Polytechnic” in their name?

colleges |> 
  summarize(n = sum(
    str_detect(INSTNM, pattern = "Polytechnic")
    )
    )

Replace / Remove Patterns

str_replace()

Replace the first matched pattern in each string.

str_replace(my_vector, 
            pattern = "Bond", 
            replace = "Franco")
[1] "Hello,"       "my name is"   "Franco"       "James Franco"


Related Function

str_replace_all() replaces all matched patterns in each string.

str_replace() with mutate()

  • Change “University” to “Uni” to pretend we live in Australia
colleges |> 
  mutate(INSTNM = str_replace(INSTNM, 
                              pattern = "University", 
                              replacement = "Uni")
         )

str_remove()

Remove the first matched pattern in each string.

colleges |> 
  mutate(INSTNM = str_remove(INSTNM, 
                             pattern = "(College|University)"
                             )
         )


Related Functions

This is a special case of str_replace(x, pattern, replacement = "").

str_remove_all() removes all matched patterns in each string.

String Lengths

str_length()

returns number of elements (characters) of a string

colleges |> 
  mutate(
    name_length = str_length(INSTNM)
         ) |> 
  select(INSTNM, name_length) |> 
  head()
# A tibble: 6 × 2
  INSTNM                              name_length
  <chr>                                     <int>
1 Alabama A & M University                     24
2 University of Alabama at Birmingham          35
3 Amridge University                           18
4 University of Alabama in Huntsville          35
5 Alabama State University                     24
6 The University of Alabama                    25

Change the Length

shorten or lengthen a string to a specified length

Extract values of a string based on a starting and ending location.

colleges |> 
  mutate(short_name = str_sub(INSTNM, 
                                  start = 1,
                                  end = 8)
         )

Make every string have a fixed length (width)

colleges |> 
  mutate(long_name = str_pad(INSTNM, 
                             width = 20, 
                             pad = "_", 
                             side = "both")
         )

Create new character variable

str_extract()

Returns a character vector with either NA or the pattern, depending on if the pattern was found.

colleges |> 
  mutate(first_word = str_extract(INSTNM,
                                  pattern = "[A-z]*"))

Warning

str_extract() only returns the first pattern match.

Modify Characters

Edit Capitalization of Strings

Convert letters in a string to a specific capitalization format.

converts all letters in a string to lowercase.

colleges |> 
  mutate(INSTNM = str_to_lower(INSTNM))

converts all letters in a string to uppercase.

colleges |> 
  mutate(INSTNM = str_to_upper(INSTNM))

converts the first letter of each word to uppercase.

colleges |> 
  mutate(INSTNM = str_to_title(INSTNM))

Handling Whitespace

Removing Whitespace

str_trim()

removes whitespace from start and end of string

str_trim("  Hi.     there  ", 
         side = "both")
[1] "Hi.     there"

str_squish()

trims whitespace from each end and collapses multiple spaces into single spaces

str_squish("  Hi.     there  ")
[1] "Hi. there"

Combining Strings

str_c()

join multiple strings into a single character vector

colleges |> 
  mutate(
    address = str_c(CITY, STABBR, ZIP, sep = ", ")
         )

Note

Similar to paste() and paste0() but with more precision.

Tips for working with strings and regex

  • Use the stringr cheatsheet!!!
  • Remember that str_xxx functions need the first argument to be a vector of strings, not a dataset!
  • Read the regular expressions out loud like a request.
  • Test out your expressions on small examples first.
  • Be patient with yourself!

PA 5.1: Scrambled Message

In this activity, you will use functions from the stringr package and regex to decode a message.

A pile of tiles from the game of Scrabble.

Collaborative Protocol

During your collaboration, your group will alternate between three roles:

  • Reads out the prompt and ensures the group understands what is being asked.
  • Manages resources (e.g., cheatsheets, textbook).
  • Answers Coder’s questions about syntax based on the resources.
  • Works with the group to debug the code.
  • Encourages the Coder to vocalize and explain their thinking.
  • Types the code specified by the Coder into the Quarto document.
  • Runs the code provided by the Coder.
  • Works with group to debug the code.
  • Evaluates the output against the question prompt.
  • Confirms they understand what the prompt is asking.
  • Talks with the group about their ideas.
  • Explains their thinking.
  • Directs the Computer what to type.
  • Works with the group to debug the code.

Submission

  • When you have completed the puzzle, you will submit the movie on Canvas.
    • You can ask me if it is correct before you submit
  • You do not need to submit your code, but you should check your code against the solutions when they are posted! . . .

Starting Roles Today

The person whose hometown is closest to SLO starts as the project manager, second as the computer.

To do…

  • PA 5.1: Scrambled Message
    • Due Tuesday by 11:59 pm
  • LA 5: Murder in SQL City
    • Due Sunday at 11:59 pm
    • You can use maximum 1 late day on this lab!
  • Review exam information posted on Canvas - we will discuss on Wednesday

Appendix – More stringr verbs and practice with regex

str_match()

Returns a character matrix containing either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_match(my_vector, pattern = "Bond")
     [,1]  
[1,] NA    
[2,] NA    
[3,] "Bond"
[4,] "Bond"

str_locate()

Returns a dateframe with two numeric variables – the starting and ending location of the pattern. The values are NA if the pattern is not found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_locate(my_vector, pattern = "Bond")
     start end
[1,]    NA  NA
[2,]    NA  NA
[3,]     1   4
[4,]     7  10

Related Function

str_sub() extracts values based on a starting and ending location.

str_glue()

Use variables in the environment to create a string based on {expressions}.

first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")
My name is Bond, James Bond

Tip

For more details, I would recommend looking up the glue R package!

What do you mean by the first match for str_extract?

Suppose we had a slightly different vector…

alt_vector <- c("Hello,", 
               "my name is", 
               "Bond, James Bond")

If we were to extract every instance of "Bond" from the vector…

str_extract(alt_vector, 
            pattern = "Bond")
[1] NA     NA     "Bond"
str_extract_all(alt_vector, 
                pattern = "Bond")
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "Bond" "Bond"

Try it out!

my_vector <- c("I scream,", 
               "you scream", 
               "we all",
               "scream",
               "for",
               "ice cream")

str_detect(my_vector, pattern = "cream")
str_locate(my_vector, pattern = "cream")
str_match(my_vector, pattern = "cream")
str_extract(my_vector, pattern = "cream")
str_subset(my_vector, pattern = "cream")

Note

For each of these functions, write down:

  • the object structure of the output.
  • the data type of the output.
  • a brief explanation of what they do.

Try it out!

What regular expressions would match words that…

  • end with a vowel?
  • start with x, y, or z?
  • contains at least one digit?
  • contains two of the same letters in a row?
x <- c("zebra", 
       "xray", 
       "apple", 
       "yellow",
       "color", 
       "patt3rn",
       "g2g",
       "summarise")

Some Possible Solutions…

  • end with a vowel?
str_subset(x, "[aeiouy]$")
  • start with x, y, or z?
str_subset(x, "^[xyz]")
  • contain at least one digit?
str_subset(x, "[:digit:]")
  • contains two of the same letters in a row
str_subset(x, "([:alpha:])\\1")

More practice!

I want to join two datasets that have a county variable:

county_pop
county pop
STORY 100000
BOONE 40000
MARSHALL 120000
POLK 500000
county_loc
county region
Story Central
Boone Central
Marshall East
Polk Central

Practice

What stringr function will help me join the county_pop and county_loc by county?

More practice!

What if I want to pull out only the area code in a phone number?

phone_numbers <- c("(515)242-1958", "(507)598-1395", "(805)938-7639")

Practice

You will need a stringr function and to use regular expressions!

str_extract(phone_numbers, "\\(\\d{3}\\)")
[1] "(515)" "(507)" "(805)"

What if I want just the numbers in the area code?

str_extract(phone_numbers, "\\((\\d{3})\\)", group = 1)
[1] "515" "507" "805"
phone_numbers |> 
  str_extract(pattern = "\\(\\d{3}\\)") |> 
  str_remove_all(pattern = "[:punct:]")
[1] "515" "507" "805"

More practice! (last one)

awards_dat
awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

More practice! (last one)

awards_dat
awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

awards_dat |> 
  mutate(artist = str_extract(awards, "[A-z\\s]+"),
         grammies = str_extract(awards, "([1-9]+)G", 
                                group = 1)) |> 
  select(artist, grammies)
artist grammies
Beyonce 35
Kendrick Lamar 22
Charli XCX 2
Cynthia Erivo 1
Viola Davis 1
Elton John 6