Using `stringr` to Work with Strings

Monday, April 27

Today we will…

Group Quiz 4
Different schedule this week & next
New material
- String variables
- Regular expressions
- Functions for working with strings
PA 5.1: Scrambled Message

Follow along

Remember to download, save, and open up the handout for today!

Week 5 Layout

Today: Strings with stringr
- Practice Activity: Decoding a Message

Wednesday: Dates with lubridate
- Practice Activity: Jewel Heist
- Discuss midterm exam

Lab Assignment Solving a Murder Mystery
- Using dplyr + stringr + ludridate
- Due Sunday
- May only use 1 late day so that I can post solutions before the exam

Week 6 Layout

Monday: Version control with git and GitHub
- There are two “Course Setup” assignments to complete before class
- Practice Activity - done in groups in class
- Last day to submit Lab 5

Wednesday: Midterm Exam

String Variables

What is a string?

A string is a bunch of characters.

There is a difference between…

…a string (many characters, one object)…

and

…a character vector (vector of strings).

my_string <- "Hi, my name is Bond!"
my_string

[1] "Hi, my name is Bond!"

my_vector <- c("Hi", "my", "name", "is", "Bond")
my_vector

[1] "Hi"   "my"   "name" "is"   "Bond"

Strings in a Data Frame

For the colleges dataset from PA 3:

a string is:

colleges$INSTNM[214]

a character vector is:

colleges$INSTNM

`stringr`

Common tasks

Identify strings containing a particular pattern.
Remove or replace a pattern.
Edit a string (e.g., make it lowercase).

Note

The stringr package loads with tidyverse.
All functions are of the form str_xxx().

`string =`

None of the stringr functions have a .data = argument!
These functions only accept a character vector string = as an input

str_detect(string = colleges$INSTNM, 
           pattern = "Polytechnic")

Thus, functions will need to be combined with functions from dplyr to work with a dataset!

`pattern =`

The pattern argument appears in many stringr functions.

The pattern must be supplied inside quotes.

str_detect(string = colleges$INSTNM, 
           pattern = "Polytechnic")

str_remove(string = colleges$INSTNM, 
           pattern = "(University|College)")

str_replace(string = colleges$INSTNM, 
            pattern = "$u", 
            replacement = "U")

Let’s talk more about what some of these symbols mean.

Regular Expressions

…are tricky!

There are lots of new symbols to keep straight.
There are a lot of cases to think through.

We’re going to focus on:

special characters
anchors
quantifiers

character classes
groups

Note: `str_subset()`

Returns a character vector containing a subset of the original character vector consisting of the elements where the pattern was found anywhere in the element.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_subset(my_vector, pattern = "Bond")

[1] "Bond"       "James Bond"

Special Characters

There is a set of characters that have a specific meaning when using regex.

The stringr package does not read these as normal characters.
These characters are:

. ^ $ \ | * + ? { } [ ] ( )

Wild Card Character: `.`

This character can match any character.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = ".ells")

[1] "sells"     "seashells"

Escape: `\\`

To match literally one of those special characters on the previous slide you need to “escape” it with \\

Use \\ to escape the . – it is now read as a normal character.

str_subset(colleges$INSTNM, pattern = "\\.")

[1] "J. F. Drake State Community and Technical College"
[2] "John F. Kennedy University"                       
[3] "Pinellas Technical College-St. Petersburg"        
[4] "St. Thomas University"                            
[5] "First Institute of Travel Inc."                   
[6] "St. John's College-Department of Nursing"

Anchor Characters: `^ $`

^ – looks at the beginning of a string.

str_subset(colleges$INSTNM, 
           pattern = "^California State")

$ – looks at the end of a string.

str_subset(colleges$INSTNM, 
           pattern = "State University$")

Character Classes: `[]`

Character classes let you specify multiple possible characters to match on.
Anything inside [] is treated like “or”

str_subset(colleges$INSTNM, 
           pattern = "^[AC]")

Excluding Characters

[^ ] – specifies characters not to match on

str_subset(colleges$INSTNM, 
           pattern = "[^y]$")

Beware / Reminder: a ^ doesn’t always mean “not”

Starts with “U”

str_subset(colleges$INSTNM, 
           pattern = "^U")

Does not start with “U”

str_subset(colleges$INSTNM, 
           pattern = "^[^U]")

Special Character Classes: `[]`

[ - ] – specifies a range of characters.

[A-Z] or [:upper:] matches any capital letter.
[a-z] or [:lower:] matches any lowercase letter.
[A-z] or [:alpha:] matches any letter
[0-9] or [:digit:] matches any number
[:punct:] matches any punctuation character

str_subset(colleges$INSTNM, 
           pattern = "[0-9]")

str_subset(colleges$INSTNM, 
           pattern = "[:punct:]$")

Quantifier Characters: `+ *`

+ – occurs 1 or more times

str_subset(colleges$INSTNM, 
           pattern = "St\\.+")

* – occurs 0 or more times

str_subset(colleges$INSTNM, 
           pattern = "\\s*-\\s*")

Tip

"\\s" is a commonly used shortcut which matches any whitespace (space, tab, etc.)

Quantifier Characters: `{}`

{n} – occurs exactly n times

str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4}")

{n,m} – occurs between n and m times

str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4,6}")

Want at least 4? {4,}

Character Groups: `()`

() creates a group of characters to be matched exactly

str_subset(colleges$INSTNM, 
           pattern = "(iss){2}")

We can specify “either” / “or” within a group using |.

str_subset(colleges$INSTNM, 
           pattern = "(T|t)ech")

We can refer to created groups with escaped numbers

str_subset(colleges$INSTNM, 
           pattern = "([aeiou])\\1")

let’s put these to use!

Detecting Patterns

`str_detect()`

Returns a logical vector indicating whether the pattern was found in each element of the supplied vector.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")
str_detect(my_vector, pattern = "Bond")

[1] FALSE FALSE  TRUE  TRUE

Pairs well with filter().
Works with summarise() + sum (to get total matches) or mean (to get proportion of matches).

`str_detect()` with `filter()`

Which colleges in the dataset have “Polytechnic” in their name?

colleges |> 
  filter(str_detect(INSTNM, pattern = "Polytechnic"))

How many colleges in the dataset have “Polytechnic” in their name?

colleges |> 
  summarize(n = sum(
    str_detect(INSTNM, pattern = "Polytechnic")
    )
    )

Replace / Remove Patterns

`str_replace()`

Replace the first matched pattern in each string.

str_replace(my_vector, 
            pattern = "Bond", 
            replace = "Franco")

[1] "Hello,"       "my name is"   "Franco"       "James Franco"

Related Function

str_replace_all() replaces all matched patterns in each string.

`str_replace()` with `mutate()`

Change “University” to “Uni” to pretend we live in Australia

colleges |> 
  mutate(INSTNM = str_replace(INSTNM, 
                              pattern = "University", 
                              replacement = "Uni")
         )

`str_remove()`

Remove the first matched pattern in each string.

colleges |> 
  mutate(INSTNM = str_remove(INSTNM, 
                             pattern = "(College|University)"
                             )
         )

Related Functions

This is a special case of str_replace(x, pattern, replacement = "").

str_remove_all() removes all matched patterns in each string.

String Lengths

`str_length()`

returns number of elements (characters) of a string

colleges |> 
  mutate(
    name_length = str_length(INSTNM)
         ) |> 
  select(INSTNM, name_length) |> 
  head()

# A tibble: 6 × 2
  INSTNM                              name_length
  <chr>                                     <int>
1 Alabama A & M University                     24
2 University of Alabama at Birmingham          35
3 Amridge University                           18
4 University of Alabama in Huntsville          35
5 Alabama State University                     24
6 The University of Alabama                    25

Change the Length

shorten or lengthen a string to a specified length

str_sub()
str_pad()

Extract values of a string based on a starting and ending location.

colleges |> 
  mutate(short_name = str_sub(INSTNM, 
                                  start = 1,
                                  end = 8)
         )

Make every string have a fixed length (width)

colleges |> 
  mutate(long_name = str_pad(INSTNM, 
                             width = 20, 
                             pad = "_", 
                             side = "both")
         )

Create new character variable

`str_extract()`

Returns a character vector with either NA or the pattern, depending on if the pattern was found.

colleges |> 
  mutate(first_word = str_extract(INSTNM,
                                  pattern = "[A-z]*"))

Warning

str_extract() only returns the first pattern match.

Modify Characters

Edit Capitalization of Strings

Convert letters in a string to a specific capitalization format.

str_to_lower()
str_to_upper()
str_to_title()

converts all letters in a string to lowercase.

colleges |> 
  mutate(INSTNM = str_to_lower(INSTNM))

converts all letters in a string to uppercase.

colleges |> 
  mutate(INSTNM = str_to_upper(INSTNM))

converts the first letter of each word to uppercase.

colleges |> 
  mutate(INSTNM = str_to_title(INSTNM))

Handling Whitespace

Removing Whitespace

str_trim()

removes whitespace from start and end of string

str_trim("  Hi.     there  ", 
         side = "both")

[1] "Hi.     there"

str_squish()

trims whitespace from each end and collapses multiple spaces into single spaces

str_squish("  Hi.     there  ")

[1] "Hi. there"

Combining Strings

`str_c()`

join multiple strings into a single character vector

colleges |> 
  mutate(
    address = str_c(CITY, STABBR, ZIP, sep = ", ")
         )

Note

Similar to paste() and paste0() but with more precision.

Tips for working with strings and regex

Use the stringr cheatsheet!!!
Remember that str_xxx functions need the first argument to be a vector of strings, not a dataset!
Read the regular expressions out loud like a request.
Test out your expressions on small examples first.
Be patient with yourself!

PA 5.1: Scrambled Message

In this activity, you will use functions from the stringr package and regex to decode a message.

A pile of tiles from the game of Scrabble.

Collaborative Protocol

During your collaboration, your group will alternate between three roles:

Project Manager
Computer
Coder

Reads out the prompt and ensures the group understands what is being asked.
Manages resources (e.g., cheatsheets, textbook).
Answers Coder’s questions about syntax based on the resources.
Works with the group to debug the code.

Encourages the Coder to vocalize and explain their thinking.
Types the code specified by the Coder into the Quarto document.
Runs the code provided by the Coder.
Works with group to debug the code.
Evaluates the output against the question prompt.

Confirms they understand what the prompt is asking.
Talks with the group about their ideas.
Explains their thinking.
Directs the Computer what to type.
Works with the group to debug the code.

Submission

When you have completed the puzzle, you will submit the movie on Canvas.
- You can ask me if it is correct before you submit
You do not need to submit your code, but you should check your code against the solutions when they are posted! . . .

Starting Roles Today

The person whose hometown is closest to SLO starts as the project manager, second as the computer.

To do…

PA 5.1: Scrambled Message
- Due Tuesday by 11:59 pm
LA 5: Murder in SQL City
- Due Sunday at 11:59 pm
- You can use maximum 1 late day on this lab!
Review exam information posted on Canvas - we will discuss on Wednesday

Appendix – More stringr verbs and practice with regex

`str_match()`

Returns a character matrix containing either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_match(my_vector, pattern = "Bond")

     [,1]  
[1,] NA    
[2,] NA    
[3,] "Bond"
[4,] "Bond"

`str_locate()`

Returns a dateframe with two numeric variables – the starting and ending location of the pattern. The values are NA if the pattern is not found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_locate(my_vector, pattern = "Bond")

     start end
[1,]    NA  NA
[2,]    NA  NA
[3,]     1   4
[4,]     7  10

Related Function

str_sub() extracts values based on a starting and ending location.

`str_glue()`

Use variables in the environment to create a string based on {expressions}.

first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")

My name is Bond, James Bond

Tip

For more details, I would recommend looking up the glue R package!

What do you mean by the first match for `str_extract`?

Suppose we had a slightly different vector…

alt_vector <- c("Hello,", 
               "my name is", 
               "Bond, James Bond")

If we were to extract every instance of "Bond" from the vector…

str_extract(alt_vector, 
            pattern = "Bond")

[1] NA     NA     "Bond"

str_extract_all(alt_vector, 
                pattern = "Bond")

[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "Bond" "Bond"

Try it out!

my_vector <- c("I scream,", 
               "you scream", 
               "we all",
               "scream",
               "for",
               "ice cream")

str_detect(my_vector, pattern = "cream")
str_locate(my_vector, pattern = "cream")
str_match(my_vector, pattern = "cream")
str_extract(my_vector, pattern = "cream")
str_subset(my_vector, pattern = "cream")

Note

For each of these functions, write down:

the object structure of the output.
the data type of the output.
a brief explanation of what they do.

Try it out!

What regular expressions would match words that…

end with a vowel?
start with x, y, or z?
contains at least one digit?
contains two of the same letters in a row?

x <- c("zebra", 
       "xray", 
       "apple", 
       "yellow",
       "color", 
       "patt3rn",
       "g2g",
       "summarise")

Some Possible Solutions…

end with a vowel?

str_subset(x, "[aeiouy]$")

start with x, y, or z?

str_subset(x, "^[xyz]")

contain at least one digit?

str_subset(x, "[:digit:]")

contains two of the same letters in a row

str_subset(x, "([:alpha:])\\1")

More practice!

I want to join two datasets that have a county variable:

county_pop

county	pop
STORY	100000
BOONE	40000
MARSHALL	120000
POLK	500000

county_loc

county	region
Story	Central
Boone	Central
Marshall	East
Polk	Central

Practice

What stringr function will help me join the county_pop and county_loc by county?

More practice!

What if I want to pull out only the area code in a phone number?

phone_numbers <- c("(515)242-1958", "(507)598-1395", "(805)938-7639")

Practice

You will need a stringr function and to use regular expressions!

str_extract(phone_numbers, "\\(\\d{3}\\)")

[1] "(515)" "(507)" "(805)"

What if I want just the numbers in the area code?

str_extract(phone_numbers, "\\((\\d{3})\\)", group = 1)

[1] "515" "507" "805"

phone_numbers |> 
  str_extract(pattern = "\\(\\d{3}\\)") |> 
  str_remove_all(pattern = "[:punct:]")

[1] "515" "507" "805"

More practice! (last one)

awards_dat

awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

More practice! (last one)

awards_dat

awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

awards_dat |> 
  mutate(artist = str_extract(awards, "[A-z\\s]+"),
         grammies = str_extract(awards, "([1-9]+)G", 
                                group = 1)) |> 
  select(artist, grammies)

artist	grammies
Beyonce	35
Kendrick Lamar	22
Charli XCX	2
Cynthia Erivo	1
Viola Davis	1
Elton John	6

Using stringr to Work with Strings

Monday, April 27

Week 5 Layout

Week 6 Layout

String Variables

What is a string?

Strings in a Data Frame

stringr

string =

pattern =

Regular Expressions

Regular Expressions

Note: str_subset()

Special Characters

Wild Card Character: .

Escape: \\

Anchor Characters: ^ $

Character Classes: []

Excluding Characters

Special Character Classes: []

Quantifier Characters: + *

Quantifier Characters: {}

Character Groups: ()

let’s put these to use!

Detecting Patterns

str_detect()

str_detect() with filter()

Replace / Remove Patterns

str_replace()

str_replace() with mutate()

str_remove()

String Lengths

str_length()

Change the Length

Create new character variable

str_extract()

Modify Characters

Edit Capitalization of Strings

Handling Whitespace

Removing Whitespace

Combining Strings

str_c()

Tips for working with strings and regex

PA 5.1: Scrambled Message

Collaborative Protocol

Submission

To do…

Appendix – More stringr verbs and practice with regex

str_match()

str_locate()

str_glue()

What do you mean by the first match for str_extract?

Try it out!

Try it out!

Some Possible Solutions…

More practice!

More practice!

More practice! (last one)

More practice! (last one)

Using `stringr` to Work with Strings

`stringr`

`string =`

`pattern =`

Note: `str_subset()`

Wild Card Character: `.`

Escape: `\\`

Anchor Characters: `^ $`

Character Classes: `[]`

Special Character Classes: `[]`

Quantifier Characters: `+ *`

Quantifier Characters: `{}`

Character Groups: `()`

`str_detect()`

`str_detect()` with `filter()`

`str_replace()`

`str_replace()` with `mutate()`

`str_remove()`

`str_length()`

`str_extract()`

`str_c()`

`str_match()`

`str_locate()`

`str_glue()`

What do you mean by the first match for `str_extract`?