Using stringr to Work with Strings

Tuesday, April 29

Today we will…

  • Different schedule this week & next
  • New material
    • String variables
    • Functions for working with strings
    • Regular expressions
  • PA 5.1: Scrambled Message

Follow along

Remember to download, save, and open up the starter notes for this week!

Week 5 Layout

  • Today: Strings with stringr
    • Practice Activity: Decoding a Message
  • Thursday: Dates with lubridate
    • Practice Activity: Jewel Heist
    • Discuss midterm exam and project
  • Lab Assignment Solving a Murder Mystery
    • Using dplyr + stringr + ludridate
    • Due (next) Monday
    • May only use 1 late day so that I can post solutions before the exam

Week 6 Layout

  • Tuesday: Version control with git
    • Practice Activity - done in groups in class
    • Last day to submit Lab 5
  • Thursday: Midterm Exam

String Variables

What is a string?

A string is a bunch of characters.

There is a difference between…

…a string (many characters, one object)…

and

…a character vector (vector of strings).

my_string <- "Hi, my name is Bond!"
my_string
[1] "Hi, my name is Bond!"
my_vector <- c("Hi", "my", "name", "is", "Bond")
my_vector
[1] "Hi"   "my"   "name" "is"   "Bond"

stringr

Common tasks

  • Identify strings containing a particular pattern.
  • Remove or replace a pattern.
  • Edit a string (e.g., make it lowercase).

Note

  • The stringr package loads with tidyverse.
  • All functions are of the form str_xxx().

pattern =

The pattern argument appears in many stringr functions.

  • The pattern must be supplied inside quotes.
my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_detect(my_vector, pattern = "Bond")
str_locate(my_vector, pattern = "James Bond")
str_match(my_vector, pattern = "[bB]ond")
str_extract(my_vector, pattern = "[jJ]ames [bB]ond")


Let’s explore these functions!

str_detect()

Returns a logical vector indicating whether the pattern was found in each element of the supplied vector.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")
str_detect(my_vector, pattern = "Bond")
[1] FALSE FALSE  TRUE  TRUE
  • Pairs well with filter().
  • Works with summarise() + sum (to get total matches) or mean (to get proportion of matches).

Related Function

str_which() returns the indexes of the strings that contain a match.

str_match()

Returns a character matrix containing either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_match(my_vector, pattern = "Bond")
     [,1]  
[1,] NA    
[2,] NA    
[3,] "Bond"
[4,] "Bond"

str_extract()

Returns a character vector with either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_extract(my_vector, pattern = "Bond")
[1] NA     NA     "Bond" "Bond"

Warning

str_extract() only returns the first pattern match.

Use str_extract_all() to return every pattern match.

What do you mean by the first match?

Suppose we had a slightly different vector…

alt_vector <- c("Hello,", 
               "my name is", 
               "Bond, James Bond")

If we were to extract every instance of "Bond" from the vector…

str_extract(alt_vector, 
            pattern = "Bond")
[1] NA     NA     "Bond"
str_extract_all(alt_vector, 
                pattern = "Bond")
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "Bond" "Bond"

str_locate()

Returns a dateframe with two numeric variables – the starting and ending location of the pattern. The values are NA if the pattern is not found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_locate(my_vector, pattern = "Bond")
     start end
[1,]    NA  NA
[2,]    NA  NA
[3,]     1   4
[4,]     7  10

Related Function

str_sub() extracts values based on a starting and ending location.

str_subset()

Returns a character vector containing a subset of the original character vector consisting of the elements where the pattern was found anywhere in the element.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_subset(my_vector, pattern = "Bond")
[1] "Bond"       "James Bond"

Try it out!

my_vector <- c("I scream,", 
               "you scream", 
               "we all",
               "scream",
               "for",
               "ice cream")

str_detect(my_vector, pattern = "cream")
str_locate(my_vector, pattern = "cream")
str_match(my_vector, pattern = "cream")
str_extract(my_vector, pattern = "cream")
str_subset(my_vector, pattern = "cream")

Note

For each of these functions, write down:

  • the object structure of the output.
  • the data type of the output.
  • a brief explanation of what they do.

Replace / Remove Patterns

Replace the first matched pattern in each string.

  • Pairs well with mutate().
str_replace(my_vector, 
            pattern = "Bond", 
            replace = "Franco")
[1] "Hello,"       "my name is"   "Franco"       "James Franco"


Related Function

str_replace_all() replaces all matched patterns in each string.

Remove the first matched pattern in each string.

str_remove(my_vector, 
           pattern = "Bond")
[1] "Hello,"     "my name is" ""           "James "    


Related Functions

This is a special case of str_replace(x, pattern, replacement = "").

str_remove_all() removes all matched patterns in each string.

Edit Strings

Convert letters in a string to a specific capitalization format.

str_to_lower() converts all letters in a string to lowercase.


str_to_lower(my_vector)
[1] "hello,"     "my name is" "bond"       "james bond"

str_to_upper() converts all letters in a string to uppercase.


str_to_upper(my_vector)
[1] "HELLO,"     "MY NAME IS" "BOND"       "JAMES BOND"

str_to_title() converts the first letter of each word to uppercase.


str_to_title(my_vector)
[1] "Hello,"     "My Name Is" "Bond"       "James Bond"

This is handy for axis labels!

Combine Strings

Join multiple strings into a single character vector.

prompt <- "Hello, my name is"
first  <- "James"
last   <- "Bond"
str_c(prompt, last, ",", first, last, sep = " ")
[1] "Hello, my name is Bond , James Bond"

Note

Similar to paste() and paste0().

Combine a vector of strings into a single string.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_flatten(my_vector, collapse = " ")
[1] "Hello, my name is Bond James Bond"

Use variables in the environment to create a string based on {expressions}.

first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")
My name is Bond, James Bond

Tip

For more details, I would recommend looking up the glue R package!

Tips for String Success

  • Refer to the stringr cheatsheet

  • Remember that str_xxx functions need the first argument to be a vector of strings, not a dataset!

    • You will use these functions inside dplyr verbs like filter() or mutate().
cereal |> 
  mutate(is_bran = str_detect(name, "Bran"), 
         .after = name)
name is_bran manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran TRUE N cold 70 4 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.40297
100% Natural Bran TRUE Q cold 120 3 5 15 2.0 8.0 8 135 0 3 1.00 1.00 33.98368
All-Bran TRUE K cold 70 4 1 260 9.0 7.0 5 320 25 3 1.00 0.33 59.42551
All-Bran with Extra Fiber TRUE K cold 50 4 0 140 14.0 8.0 0 330 25 3 1.00 0.50 93.70491
Almond Delight FALSE R cold 110 2 2 200 1.0 14.0 8 -1 25 3 1.00 0.75 34.38484
Apple Cinnamon Cheerios FALSE G cold 110 2 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.50954
Apple Jacks FALSE K cold 110 2 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.17409
Basic 4 FALSE G cold 130 3 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.03856
Bran Chex TRUE R cold 90 2 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.12025
Bran Flakes TRUE P cold 90 3 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.31381
Cap'n'Crunch FALSE Q cold 120 1 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.04285
Cheerios FALSE G cold 110 6 2 290 2.0 17.0 1 105 25 1 1.00 1.25 50.76500
Cinnamon Toast Crunch FALSE G cold 120 1 3 210 0.0 13.0 9 45 25 2 1.00 0.75 19.82357
Clusters FALSE G cold 110 3 2 140 2.0 13.0 7 105 25 3 1.00 0.50 40.40021
Cocoa Puffs FALSE G cold 110 1 1 180 0.0 12.0 13 55 25 2 1.00 1.00 22.73645
Corn Chex FALSE R cold 110 2 0 280 0.0 22.0 3 25 25 1 1.00 1.00 41.44502
Corn Flakes FALSE K cold 100 2 0 290 1.0 21.0 2 35 25 1 1.00 1.00 45.86332
Corn Pops FALSE K cold 110 1 0 90 1.0 13.0 12 20 25 2 1.00 1.00 35.78279
Count Chocula FALSE G cold 110 1 1 180 0.0 12.0 13 65 25 2 1.00 1.00 22.39651
Cracklin' Oat Bran TRUE K cold 110 3 3 140 4.0 10.0 7 160 25 3 1.00 0.50 40.44877
Cream of Wheat (Quick) FALSE N hot 100 3 0 80 1.0 21.0 0 -1 0 2 1.00 1.00 64.53382
Crispix FALSE K cold 110 2 0 220 1.0 21.0 3 30 25 3 1.00 1.00 46.89564
Crispy Wheat & Raisins FALSE G cold 100 2 1 140 2.0 11.0 10 120 25 3 1.00 0.75 36.17620
Double Chex FALSE R cold 100 2 0 190 1.0 18.0 5 80 25 3 1.00 0.75 44.33086
Froot Loops FALSE K cold 110 2 1 125 1.0 11.0 13 30 25 2 1.00 1.00 32.20758
Frosted Flakes FALSE K cold 110 1 0 200 1.0 14.0 11 25 25 1 1.00 0.75 31.43597
Frosted Mini-Wheats FALSE K cold 100 3 0 0 3.0 14.0 7 100 25 2 1.00 0.80 58.34514
Fruit & Fibre Dates; Walnuts; and Oats FALSE P cold 120 3 2 160 5.0 12.0 10 200 25 3 1.25 0.67 40.91705
Fruitful Bran TRUE K cold 120 3 0 240 5.0 14.0 12 190 25 3 1.33 0.67 41.01549
Fruity Pebbles FALSE P cold 110 1 1 135 0.0 13.0 12 25 25 2 1.00 0.75 28.02576
Golden Crisp FALSE P cold 100 2 0 45 0.0 11.0 15 40 25 1 1.00 0.88 35.25244
Golden Grahams FALSE G cold 110 1 1 280 0.0 15.0 9 45 25 2 1.00 0.75 23.80404
Grape Nuts Flakes FALSE P cold 100 3 1 140 3.0 15.0 5 85 25 3 1.00 0.88 52.07690
Grape-Nuts FALSE P cold 110 3 0 170 3.0 17.0 3 90 25 3 1.00 0.25 53.37101
Great Grains Pecan FALSE P cold 120 3 3 75 3.0 13.0 4 100 25 3 1.00 0.33 45.81172
Honey Graham Ohs FALSE Q cold 120 1 2 220 1.0 12.0 11 45 25 2 1.00 1.00 21.87129
Honey Nut Cheerios FALSE G cold 110 3 1 250 1.5 11.5 10 90 25 1 1.00 0.75 31.07222
Honey-comb FALSE P cold 110 1 0 180 0.0 14.0 11 35 25 1 1.00 1.33 28.74241
Just Right Crunchy Nuggets FALSE K cold 110 2 1 170 1.0 17.0 6 60 100 3 1.00 1.00 36.52368
Just Right Fruit & Nut FALSE K cold 140 3 1 170 2.0 20.0 9 95 100 3 1.30 0.75 36.47151
Kix FALSE G cold 110 2 1 260 0.0 21.0 3 40 25 2 1.00 1.50 39.24111
Life FALSE Q cold 100 4 2 150 2.0 12.0 6 95 25 2 1.00 0.67 45.32807
Lucky Charms FALSE G cold 110 2 1 180 0.0 12.0 12 55 25 2 1.00 1.00 26.73451
Maypo FALSE A hot 100 4 1 0 0.0 16.0 3 95 25 2 1.00 1.00 54.85092
Muesli Raisins; Dates; & Almonds FALSE R cold 150 4 3 95 3.0 16.0 11 170 25 3 1.00 1.00 37.13686
Muesli Raisins; Peaches; & Pecans FALSE R cold 150 4 3 150 3.0 16.0 11 170 25 3 1.00 1.00 34.13976
Mueslix Crispy Blend FALSE K cold 160 3 2 150 3.0 17.0 13 160 25 3 1.50 0.67 30.31335
Multi-Grain Cheerios FALSE G cold 100 2 1 220 2.0 15.0 6 90 25 1 1.00 1.00 40.10596
Nut&Honey Crunch FALSE K cold 120 2 1 190 0.0 15.0 9 40 25 2 1.00 0.67 29.92429
Nutri-Grain Almond-Raisin FALSE K cold 140 3 2 220 3.0 21.0 7 130 25 3 1.33 0.67 40.69232
Nutri-grain Wheat FALSE K cold 90 3 0 170 3.0 18.0 2 90 25 3 1.00 1.00 59.64284
Oatmeal Raisin Crisp FALSE G cold 130 3 2 170 1.5 13.5 10 120 25 3 1.25 0.50 30.45084
Post Nat. Raisin Bran TRUE P cold 120 3 1 200 6.0 11.0 14 260 25 3 1.33 0.67 37.84059
Product 19 FALSE K cold 100 3 0 320 1.0 20.0 3 45 100 3 1.00 1.00 41.50354
Puffed Rice FALSE Q cold 50 1 0 0 0.0 13.0 0 15 0 3 0.50 1.00 60.75611
Puffed Wheat FALSE Q cold 50 2 0 0 1.0 10.0 0 50 0 3 0.50 1.00 63.00565
Quaker Oat Squares FALSE Q cold 100 4 1 135 2.0 14.0 6 110 25 3 1.00 0.50 49.51187
Quaker Oatmeal FALSE Q hot 100 5 2 0 2.7 -1.0 -1 110 0 1 1.00 0.67 50.82839
Raisin Bran TRUE K cold 120 3 1 210 5.0 14.0 12 240 25 2 1.33 0.75 39.25920
Raisin Nut Bran TRUE G cold 100 3 2 140 2.5 10.5 8 140 25 3 1.00 0.50 39.70340
Raisin Squares FALSE K cold 90 2 0 0 2.0 15.0 6 110 25 3 1.00 0.50 55.33314
Rice Chex FALSE R cold 110 1 0 240 0.0 23.0 2 30 25 1 1.00 1.13 41.99893
Rice Krispies FALSE K cold 110 2 0 290 0.0 22.0 3 35 25 1 1.00 1.00 40.56016
Shredded Wheat FALSE N cold 80 2 0 0 3.0 16.0 0 95 0 1 0.83 1.00 68.23588
Shredded Wheat 'n'Bran TRUE N cold 90 3 0 0 4.0 19.0 0 140 0 1 1.00 0.67 74.47295
Shredded Wheat spoon size FALSE N cold 90 3 0 0 3.0 20.0 0 120 0 1 1.00 0.67 72.80179
Smacks FALSE K cold 110 2 1 70 1.0 9.0 15 40 25 2 1.00 0.75 31.23005
Special K FALSE K cold 110 6 0 230 1.0 16.0 3 55 25 1 1.00 1.00 53.13132
Strawberry Fruit Wheats FALSE N cold 90 2 0 15 3.0 15.0 5 90 25 2 1.00 1.00 59.36399
Total Corn Flakes FALSE G cold 110 2 1 200 0.0 21.0 3 35 100 3 1.00 1.00 38.83975
Total Raisin Bran TRUE G cold 140 3 1 190 4.0 15.0 14 230 100 3 1.50 1.00 28.59278
Total Whole Grain FALSE G cold 100 3 1 200 3.0 16.0 3 110 100 3 1.00 1.00 46.65884
Triples FALSE G cold 110 2 1 250 0.0 21.0 3 60 25 3 1.00 0.75 39.10617
Trix FALSE G cold 110 1 1 140 0.0 13.0 12 25 25 2 1.00 1.00 27.75330
Wheat Chex FALSE R cold 100 3 1 230 3.0 17.0 3 115 25 1 1.00 0.67 49.78744
Wheaties FALSE G cold 100 3 1 200 3.0 17.0 3 110 25 1 1.00 1.00 51.59219
Wheaties Honey Gold FALSE G cold 110 2 1 200 1.0 16.0 8 60 25 1 1.00 0.75 36.18756

Tips for String Success

The real power of these str_xxx functions comes when you specify the pattern using regular expressions!

The image is a comic strip from xkcd titled 'Regular Expressions.' It humorously portrays a programmer's overconfidence in using regular expressions to solve complex text processing tasks. In the first panel, a stick figure declares, 'EVERYBODY STAND BACK,' and in the second panel, they assert, 'I KNOW REGULAR EXPRESSIONS,' suggesting that their expertise is both a warning and a badge of honor. This reflects the sentiment that while regular expressions are powerful tools in programming, they can also lead to intricate and hard-to-maintain code if not used judiciously.

regex

Regular Expressions

“Regexps are a very terse language that allow you to describe patterns in strings.”

R for Data Science

Use str_xxx functions + regular expressions!

str_detect(string  = my_string_vector,
           pattern = "p[ei]ck[a-z]")

Tip

You might encounter gsub(), grep(), etc. from Base R, but I would highly recommending using functions from the stringr package instead.

Regular Expressions

…are tricky!

  • There are lots of new symbols to keep straight.
  • There are a lot of cases to think through.


This web app for testing R regular expressions might be handy!

Special Characters

There is a set of characters that have a specific meaning when using regex.

  • The stringr package does not read these as normal characters.
  • These characters are:

. ^ $ \ | * + ? { } [ ] ( )

Wild Card Character: .

This character can match any character.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = ".ells")
[1] "sells"     "seashells"


str_extract(x, pattern = ".ells")
[1] NA      "sells" "hells" NA      NA      NA     


This matches strings that contain any character followed by “ells”.

Anchor Characters: ^ $

^ – looks at the beginning of a string.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = "^s")
[1] "sells"     "seashells" "seashore!"

This matches strings that start with “s”.

$ – looks at the end of a string.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = "s$")
[1] "sells"     "seashells"

This matches strings that end with “s”.

Quantifier Characters: ? + *

? – matches when the preceding character occurs 0 or 1 times in a row.

x <- c("shes", 
       "shels", 
       "shells", 
       "shellls", 
       "shelllls")

str_subset(x, pattern = "shel?s")
[1] "shes"  "shels"

+ – occurs 1 or more times in a row.

str_subset(x, pattern = "shel+s")
[1] "shels"    "shells"   "shellls"  "shelllls"

* – occurs 0 or more times in a row.

str_subset(x, pattern = "shel*s")
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"

Quantifier Characters: {}

{n} – matches when the preceding character occurs exactly n times in a row.

x <- c("shes", 
       "shels", 
       "shells", 
       "shellls", 
       "shelllls")

str_subset(x, pattern = "shel{2}s")
[1] "shells"

{n,} – occurs at least n times in a row.

str_subset(x, pattern = "shel{2,}s")
[1] "shells"   "shellls"  "shelllls"

{n,m} – occurs between n and m times in a row.

str_subset(x, pattern = "shel{1,3}s")
[1] "shels"   "shells"  "shellls"

Character Classes: []

Character classes let you specify multiple possible characters to match on.

x <- c("Peter", 
       "Piper", 
       "picked", 
       "a",
       "peck",
       "of",
       "pickled",
       "peppers!")

str_subset(x, pattern = "p[ei]ck")
[1] "picked"  "peck"    "pickled"

Matches you don’t want

[^ ] – specifies characters not to match on (think except)

str_subset(x, pattern = "p[^i]ck")
[1] "peck"


But remember that ^ outside of brackets specifies the first charatcter in a string.

str_subset(x, pattern = "^p")
[1] "picked"   "peck"     "pickled"  "peppers!"


str_subset(x, pattern = "^[^p]")
[1] "Peter" "Piper" "a"     "of"   

Warning

Why do “Peter” and “Piper” not match "^[^p]"?

Capitilization matters!

Character Classes: []

[ - ] – specifies a range of characters.

x <- c("Peter", 
       "Piper", 
       "picked", 
       "a",
       "peck",
       "of",
       "pickled",
       "peppers!")

str_subset(x, pattern = "p[ei]ck[a-z]")
[1] "picked"  "pickled"
  • [A-Z] matches any capital letter.
  • [a-z] matches any lowercase letter.
  • [A-z] or [:alpha:] matches any letter
  • [0-9] or [:digit:] matches any number

Shortcuts

  • \\w – matches any “word” (\\W matches not “word”)

    • A “word” contains any letters and numbers.
  • \\d – matches any digit (\\D matches not digit)

  • \\s – matches any whitespace (\\S matches not whitespace)

    • Whitespace includes spaces, tabs, newlines, etc.


x <- "phone number: 1234567899"

str_extract(x, pattern = "\\d+")
[1] "1234567899"
str_extract_all(x, pattern = "\\S+")
[[1]]
[1] "phone"      "number:"    "1234567899"

Character Groups: ()

Groups are created with ( ).

  • We can specify “either” / “or” within a group using |.
x <- c("Peter", 
       "Piper", 
       "picked", 
       "a", 
       "peck",
       "of", 
       "pickled",
       "peppers!")

str_subset(x, pattern = "p(e|i)ck")
[1] "picked"  "peck"    "pickled"


This matches strings that contain either “peck” or “pick”.

Character Groups: ()

  • We can then reference groups in order with escaped numbers (\\1) to specify that certain groupings repeat.
x <- c("hannah", 
       "had", 
       "a", 
       "ball", 
       "on",
       "a", 
       "race car")

str_subset(x, pattern = "^(.).*\\1$")
[1] "hannah"   "race car"


This matches strings that start and end with the same character.

Character Groups: ()

  • Groups also let us be very precise with extracting strings!
shopping_list <- c("apples x4", 
                   "bag of flour", 
                   "bag of sugar", 
                   "milk x2")

str_extract(shopping_list, "([a-z]+) x([1-9])")
[1] "apples x4" NA          NA          "milk x2"  


str_extract(shopping_list, "([a-z]+) x([1-9])", group = 1)
[1] "apples" NA       NA       "milk"  


str_extract(shopping_list, "([a-z]+) x([1-9])", group = 2)
[1] "4" NA  NA  "2"

Try it out!

What regular expressions would match words that…

  • end with a vowel?
  • start with x, y, or z?
  • contains at least one digit?
  • contains two of the same letters in a row?
x <- c("zebra", 
       "xray", 
       "apple", 
       "yellow",
       "color", 
       "patt3rn",
       "g2g",
       "summarise")

Some Possible Solutions…

  • end with a vowel?
str_subset(x, "[aeiouy]$")
  • start with x, y, or z?
str_subset(x, "^[xyz]")
  • contain at least one digit?
str_subset(x, "[:digit:]")
  • contains two of the same letters in a row
str_subset(x, "([:alpha:])\\1")

Escape: \\

To match a special character, you need to escape it.

x <- c("How",
       "much", 
       "wood",
       "could",
       "a",
       "woodchuck",
       "chuck",
       "if",
       "a",
       "woodchuck",
       "could",
       "chuck",
       "wood?")

str_subset(x, pattern = "?")
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)

Escape: \\

Use \\ to escape the ? – it is now read as a normal character.

str_subset(x, pattern = "\\?")
[1] "wood?"


Note

Alternatively, you could use []:

str_subset(x, pattern = "[?]")
[1] "wood?"

When in Doubt


Use the web app to test R regular expressions.

Tips for working with regex

  • Read the regular expressions out loud like a request.
  • Test out your expressions on small examples first.

str_view()

str_view(c("shes", "shels", "shells", "shellls", "shelllls"), "l+")
[2] │ she<l>s
[3] │ she<ll>s
[4] │ she<lll>s
[5] │ she<llll>s
  • Be kind to yourself!

More practice!

I want to join two datasets that have a county variable:

county_pop
county pop
STORY 100000
BOONE 40000
MARSHALL 120000
POLK 500000
county_loc
county region
Story Central
Boone Central
Marshall East
Polk Central

Practice

What stringr function will help me join the county_pop and county_loc by county?

More practice!

What if I want to pull out only the area code in a phone number?

phone_numbers <- c("(515)242-1958", "(507)598-1395", "(805)938-7639")

Practice

You will need a stringr function and to use regular expressions!

str_extract(phone_numbers, "\\(\\d{3}\\)")
[1] "(515)" "(507)" "(805)"

What if I want just the numbers in the area code?

str_extract(phone_numbers, "\\((\\d{3})\\)", group = 1)
[1] "515" "507" "805"
phone_numbers |> 
  str_extract(pattern = "\\(\\d{3}\\)") |> 
  str_remove_all(pattern = "[:punct:]")
[1] "515" "507" "805"

More practice! (last one)

awards_dat
awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

More practice! (last one)

awards_dat
awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

awards_dat |> 
  mutate(artist = str_extract(awards, "[A-z\\s]+"),
         grammies = str_extract(awards, "([1-9]+)G", 
                                group = 1)) |> 
  select(artist, grammies)
artist grammies
Beyonce 35
Kendrick Lamar 22
Charli XCX 2
Cynthia Erivo 1
Viola Davis 1
Elton John 6

PA 5.1: Scrambled Message

In this activity, you will use functions from the stringr package and regex to decode a message.

A pile of tiles from the game of Scrabble.

To do…

  • PA 5.1: Scrambled Message
    • Due Thursday before class
  • LA 5: Murder in SQL City
    • Due Monday at 11:59 pm
    • You can use maximum 1 late day on this lab!
  • Look out for exam information posted on Canvas - we will discuss on Thursday