Week 5, Part 1: Working with Strings Using stringr

This week is all about special data types in R. Similar to the tools you learned last week for working with factors, this week you are going to learn about tools for working with strings and dates

1 Learning Objectives

  • hypothesize what type of output a str_XXX() function will provide (e.g., character vector, logical vector, matrix, numeric vector)
  • recognize what str_XXX() function is best suited for a particular problem (e.g., replacing, detecting a pattern, removing whitespace)
  • construct simple regular expressions using character classes ([]), repeated patterns ({2,}), anchoring (^ and $), and groups (())

πŸ“– Readings: 60-75 min

πŸ“½ Optional Videos: 10 min


Important

I provide the stringr cheatsheet for your table in class, but I would strongly recommend you also print it for yourself to use outside of class!

2 Working with Strings

πŸ“– Required Reading: R4DS 14.1-14.5: Strings

  • Note: don’t read past 14.5! You could also skip 14.3 and come back to it.

πŸ“½ Optional Video: Strings

NoteCommon stringr functions and what they do
Across every function in the stringr package, x is the string (or vector of strings) and pattern is a pattern to be found within the string.
Task stringr Output
Find a pattern and replace it str_replace(x, pattern, replacement) and str_replace_all(x, pattern, replacement) Modified string or character vector
Convert a string from uppercase to lower case or visa versa str_to_lower(x), str_to_upper(x) , str_to_title(x) Modified string or character vector
Strip whitespace from the start / end of a string str_trim(x) , str_squish(x) Modified string or character vector
Detect if the string contains a pattern str_detect(x, pattern) Logical
Count how many times a pattern appears in the string str_count(x, pattern) Numeric
Find the first appearance of the pattern within the string str_locate(x, pattern) Integer matrix (start position, end position)
Find all appearances of the pattern within the string str_locate_all(x, pattern) Integer matrix (start position, end position)
Detect if a string contains a pattern at the start / end str_starts(x, pattern), str_ends(x, pattern) Logical
Subset a string from index a to b str_sub(x, a, b) Modified string or character vector

3 Regular Expressions

Regular expressions (or regex) is a concise and powerful language created to describe patterns in strings. We are going to cover only relatively simple regular expressions in this class, but you could build a regular expression to match extremely complex string patterns! I think that it is a fun puzzle. The stringr cheatsheet has a helpful regex reference. You can also use this web app to test regular expressions for R.

Important

Regular expressions in R can be slightly different from python or other languages! Mainly, you usually need more escapes "\\" than in other languages.

πŸ“– Required Reading: R4DS 15.1-15.4: Strings

Some basics:

  • [] enclose sets of characters

    • For example, [abc] will match any single character a, b, c
  • - specifies a range of characters

    • For example, A-z will match all upper and lower case letters (A-Z, and then a-z)
  • . matches any character (except a newline)

  • To match special characters, you need to escape them using a \\ in R.

    • For example, \\. will match a literal ., \\$ will match a literal $.
num_string <- "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789"

str_extract(num_string, 
            pattern = "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]")
[1] "123-45-6789"

Repeating Patterns

Listing out all of those numbers can get repetitive, though. How do we specify repetition?

  • * means repeat between 0 and inf times
  • + means 1 or more times
  • ? means 0 or 1 times – most useful when you’re looking for something optional
  • {a, b} means repeat between a and b times, where a and b are integers.
    • Note that b can be blank. So [abc]{3,} will match abc, aaaa, cbbaa, but not ab, bb, or a.
  • {a} specifies an exact number of repeated charaters.
    • For example, {3} means β€œexactly 3 times” whereas {3,} means β€œ3 or more times.”
num_string
[1] "phone: 123-456-7890, nuid: 12345678, ssn: 123-45-6789"
# Matches a sequence of *three* numbers, followed by a dash, 
# then a sequence of *two* numbers, followed by a dash, 
# then a sequence of *four* numbers, followed by a dash.
str_extract(num_string, pattern = "[0-9]{3}-[0-9]{2}-[0-9]{4}")
[1] "123-45-6789"
# Matches a sequence of *three* numbers, followed by any character, 
# then a sequence of *three* numbers, followed by any character, 
# then a sequence of *four* numbers, followed by any character, 
str_extract(num_string, pattern = "[0-9]{3}.[0-9]{3}.[0-9]{4}")
[1] "123-456-7890"
# Matches a sequence of at least *eight* numbers 
str_extract(num_string, pattern = "[0-9]{8,}")
[1] "12345678"

Anchoring

There are also ways to β€œanchor” a pattern to a part of the string (e.g. the beginning or the end)

  • ^ has multiple meanings:
    • If ^ is the first character in a pattern (e.g., ^Al) it matches the beginning of a string.
    • If ^ follows a [ (e.g., [^abc]) then it means β€œnot.” So, [^abc] means β€œthe collection of all characters that are not a, b, or c.”
  • $ means the end of a string (e.g., bold$)
address <- "1600 Pennsylvania Ave NW, Washington D.C., 20500"

Grabbing the house number

# Match a sequence of one or more digits at the beginning of the string
house_num <- str_extract(address, pattern = "^[0-9]{1,}")
house_num
[1] "1600"

Grabbing the street

# Match everything alphanumeric up to the comma
street <- str_extract(address, pattern = "[A-z0-9 ]{1,}")
street
[1] "1600 Pennsylvania Ave NW"
# Remove house number from street address
street <- str_remove(street, house_num) |> 
  # Trim any leading or trailing whitespace from remaining string
  str_trim() 
street
[1] "Pennsylvania Ave NW"

Grabbing the city

# Match one or more characters between the two commas  
city <- str_extract(address, pattern = ",.{1,},") |> 
  # Remove the leading and trailing commas
  str_remove_all(",") |> 
  # Trim any leading or trailing whitespace from remaining string
  str_trim()
city
[1] "Washington D.C."

Grabbing the zip code

# Matches both 5 and 9 digit zip codes found at the end of the string
zip <- str_extract(address, pattern = "[0-9]{5,10}$") 
zip
[1] "20500"

Making Groups

  • () are used to capture information
    • For example, ([0-9]{4}) captures any 4-digit number
  • an | means β€œor”
    • For example, a|b will select a or b.
  • You can reference formed groups to indicate repetition or extract only certain parts of your regex

Making a group of characters

fruit
 [1] "apple"             "apricot"           "avocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
[13] "canary melon"      "cantaloupe"        "cherimoya"        
[16] "cherry"            "chili pepper"      "clementine"       
[19] "cloudberry"        "coconut"           "cranberry"        
[22] "cucumber"          "currant"           "damson"           
[25] "date"              "dragonfruit"       "durian"           
[28] "eggplant"          "elderberry"        "feijoa"           
[31] "fig"               "goji berry"        "gooseberry"       
[34] "grape"             "grapefruit"        "guava"            
[37] "honeydew"          "huckleberry"       "jackfruit"        
[40] "jambul"            "jujube"            "kiwi fruit"       
[43] "kumquat"           "lemon"             "lime"             
[46] "loquat"            "lychee"            "mandarine"        
[49] "mango"             "mulberry"          "nectarine"        
[52] "nut"               "olive"             "orange"           
[55] "pamelo"            "papaya"            "passionfruit"     
[58] "peach"             "pear"              "persimmon"        
[61] "physalis"          "pineapple"         "plum"             
[64] "pomegranate"       "pomelo"            "purple mangosteen"
[67] "quince"            "raisin"            "rambutan"         
[70] "raspberry"         "redcurrant"        "rock melon"       
[73] "salal berry"       "satsuma"           "star fruit"       
[76] "strawberry"        "tamarillo"         "tangerine"        
[79] "ugli fruit"        "watermelon"       
# Searches for fruits containing a group of two p's (need to be together)
str_view(fruit, pattern = "(pp)")
 [1] β”‚ a<pp>le
 [5] β”‚ bell pe<pp>er
[17] β”‚ chili pe<pp>er
[62] β”‚ pinea<pp>le

Using an β€œor”

# Searches for fruits containing apple OR melon OR nut
str_view(fruit, pattern = "apple|melon|nut")
 [1] β”‚ <apple>
[13] β”‚ canary <melon>
[20] β”‚ coco<nut>
[52] β”‚ <nut>
[62] β”‚ pine<apple>
[72] β”‚ rock <melon>
[80] β”‚ water<melon>

Referencing groups

We can then reference groups in order with escaped numbers (\\1) to specify that certain groupings repeat.

# Searches for fruits where there are two characters (..) that get repeated
# a second time \\1
str_view(fruit, pattern = "(..)\\1")
 [4] β”‚ b<anan>a
[20] β”‚ <coco>nut
[22] β”‚ <cucu>mber
[41] β”‚ <juju>be
[56] β”‚ <papa>ya
[73] β”‚ s<alal> berry

Groups also let us be very precise with extracting strings!

shopping_list <- c("apples x4", 
                   "bag of flour", 
                   "bag of sugar", 
                   "milk x2")

str_extract(shopping_list, "([a-z]+) x([1-9])")
[1] "apples x4" NA          NA          "milk x2"  
str_extract(shopping_list, "([a-z]+) x([1-9])", group = 1)
[1] "apples" NA       NA       "milk"  
str_extract(shopping_list, "([a-z]+) x([1-9])", group = 2)
[1] "4" NA  NA  "2"