Using `stringr` to Work with Strings

Tuesday, April 29

Today we will…

Different schedule this week & next
New material
- String variables
- Functions for working with strings
- Regular expressions
PA 5.1: Scrambled Message

Follow along

Remember to download, save, and open up the starter notes for this week!

Week 5 Layout

Today: Strings with stringr
- Practice Activity: Decoding a Message

Thursday: Dates with lubridate
- Practice Activity: Jewel Heist
- Discuss midterm exam and project

Lab Assignment Solving a Murder Mystery
- Using dplyr + stringr + ludridate
- Due (next) Monday
- May only use 1 late day so that I can post solutions before the exam

Week 6 Layout

Tuesday: Version control with git
- Practice Activity - done in groups in class
- Last day to submit Lab 5

Thursday: Midterm Exam

String Variables

What is a string?

A string is a bunch of characters.

There is a difference between…

…a string (many characters, one object)…

and

…a character vector (vector of strings).

my_string <- "Hi, my name is Bond!"
my_string

[1] "Hi, my name is Bond!"

my_vector <- c("Hi", "my", "name", "is", "Bond")
my_vector

[1] "Hi"   "my"   "name" "is"   "Bond"

`stringr`

Common tasks

Identify strings containing a particular pattern.
Remove or replace a pattern.
Edit a string (e.g., make it lowercase).

Note

The stringr package loads with tidyverse.
All functions are of the form str_xxx().

`pattern =`

The pattern argument appears in many stringr functions.

The pattern must be supplied inside quotes.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_detect(my_vector, pattern = "Bond")
str_locate(my_vector, pattern = "James Bond")
str_match(my_vector, pattern = "[bB]ond")
str_extract(my_vector, pattern = "[jJ]ames [bB]ond")

Let’s explore these functions!

`str_detect()`

Returns a logical vector indicating whether the pattern was found in each element of the supplied vector.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")
str_detect(my_vector, pattern = "Bond")

[1] FALSE FALSE  TRUE  TRUE

Pairs well with filter().
Works with summarise() + sum (to get total matches) or mean (to get proportion of matches).

Related Function

str_which() returns the indexes of the strings that contain a match.

`str_match()`

Returns a character matrix containing either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_match(my_vector, pattern = "Bond")

     [,1]  
[1,] NA    
[2,] NA    
[3,] "Bond"
[4,] "Bond"

`str_extract()`

Returns a character vector with either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_extract(my_vector, pattern = "Bond")

[1] NA     NA     "Bond" "Bond"

Warning

str_extract() only returns the first pattern match.

Use str_extract_all() to return every pattern match.

What do you mean by the first match?

Suppose we had a slightly different vector…

alt_vector <- c("Hello,", 
               "my name is", 
               "Bond, James Bond")

If we were to extract every instance of "Bond" from the vector…

str_extract(alt_vector, 
            pattern = "Bond")

[1] NA     NA     "Bond"

str_extract_all(alt_vector, 
                pattern = "Bond")

[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "Bond" "Bond"

`str_locate()`

Returns a dateframe with two numeric variables – the starting and ending location of the pattern. The values are NA if the pattern is not found.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_locate(my_vector, pattern = "Bond")

     start end
[1,]    NA  NA
[2,]    NA  NA
[3,]     1   4
[4,]     7  10

Related Function

str_sub() extracts values based on a starting and ending location.

`str_subset()`

Returns a character vector containing a subset of the original character vector consisting of the elements where the pattern was found anywhere in the element.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_subset(my_vector, pattern = "Bond")

[1] "Bond"       "James Bond"

Try it out!

my_vector <- c("I scream,", 
               "you scream", 
               "we all",
               "scream",
               "for",
               "ice cream")

str_detect(my_vector, pattern = "cream")
str_locate(my_vector, pattern = "cream")
str_match(my_vector, pattern = "cream")
str_extract(my_vector, pattern = "cream")
str_subset(my_vector, pattern = "cream")

Note

For each of these functions, write down:

the object structure of the output.
the data type of the output.
a brief explanation of what they do.

Replace the first matched pattern in each string.

Pairs well with mutate().

str_replace(my_vector, 
            pattern = "Bond", 
            replace = "Franco")

[1] "Hello,"       "my name is"   "Franco"       "James Franco"

Related Function

str_replace_all() replaces all matched patterns in each string.

Remove the first matched pattern in each string.

str_remove(my_vector, 
           pattern = "Bond")

[1] "Hello,"     "my name is" ""           "James "

Related Functions

This is a special case of str_replace(x, pattern, replacement = "").

str_remove_all() removes all matched patterns in each string.

Edit Strings

Convert letters in a string to a specific capitalization format.

lower
UPPER
Title

str_to_lower() converts all letters in a string to lowercase.

str_to_lower(my_vector)

[1] "hello,"     "my name is" "bond"       "james bond"

str_to_upper() converts all letters in a string to uppercase.

str_to_upper(my_vector)

[1] "HELLO,"     "MY NAME IS" "BOND"       "JAMES BOND"

str_to_title() converts the first letter of each word to uppercase.

str_to_title(my_vector)

[1] "Hello,"     "My Name Is" "Bond"       "James Bond"

This is handy for axis labels!

Combine Strings

str_c()
str_flatten()
str_glue()

Join multiple strings into a single character vector.

prompt <- "Hello, my name is"
first  <- "James"
last   <- "Bond"
str_c(prompt, last, ",", first, last, sep = " ")

[1] "Hello, my name is Bond , James Bond"

Note

Similar to paste() and paste0().

Combine a vector of strings into a single string.

my_vector <- c("Hello,", 
               "my name is", 
               "Bond", 
               "James Bond")

str_flatten(my_vector, collapse = " ")

[1] "Hello, my name is Bond James Bond"

Use variables in the environment to create a string based on {expressions}.

first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")

My name is Bond, James Bond

Tip

For more details, I would recommend looking up the glue R package!

Tips for String Success

Refer to the stringr cheatsheet
Remember that str_xxx functions need the first argument to be a vector of strings, not a dataset!
- You will use these functions inside dplyr verbs like filter() or mutate().

cereal |> 
  mutate(is_bran = str_detect(name, "Bran"), 
         .after = name)

name	is_bran	manuf	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
100% Bran	TRUE	N	cold	70	4	1	130	10.0	5.0	6	280	25	3	1.00	0.33	68.40297
100% Natural Bran	TRUE	Q	cold	120	3	5	15	2.0	8.0	8	135	0	3	1.00	1.00	33.98368
All-Bran	TRUE	K	cold	70	4	1	260	9.0	7.0	5	320	25	3	1.00	0.33	59.42551
All-Bran with Extra Fiber	TRUE	K	cold	50	4	0	140	14.0	8.0	0	330	25	3	1.00	0.50	93.70491
Almond Delight	FALSE	R	cold	110	2	2	200	1.0	14.0	8	-1	25	3	1.00	0.75	34.38484
Apple Cinnamon Cheerios	FALSE	G	cold	110	2	2	180	1.5	10.5	10	70	25	1	1.00	0.75	29.50954
Apple Jacks	FALSE	K	cold	110	2	0	125	1.0	11.0	14	30	25	2	1.00	1.00	33.17409
Basic 4	FALSE	G	cold	130	3	2	210	2.0	18.0	8	100	25	3	1.33	0.75	37.03856
Bran Chex	TRUE	R	cold	90	2	1	200	4.0	15.0	6	125	25	1	1.00	0.67	49.12025
Bran Flakes	TRUE	P	cold	90	3	0	210	5.0	13.0	5	190	25	3	1.00	0.67	53.31381
Cap'n'Crunch	FALSE	Q	cold	120	1	2	220	0.0	12.0	12	35	25	2	1.00	0.75	18.04285
Cheerios	FALSE	G	cold	110	6	2	290	2.0	17.0	1	105	25	1	1.00	1.25	50.76500
Cinnamon Toast Crunch	FALSE	G	cold	120	1	3	210	0.0	13.0	9	45	25	2	1.00	0.75	19.82357
Clusters	FALSE	G	cold	110	3	2	140	2.0	13.0	7	105	25	3	1.00	0.50	40.40021
Cocoa Puffs	FALSE	G	cold	110	1	1	180	0.0	12.0	13	55	25	2	1.00	1.00	22.73645
Corn Chex	FALSE	R	cold	110	2	0	280	0.0	22.0	3	25	25	1	1.00	1.00	41.44502
Corn Flakes	FALSE	K	cold	100	2	0	290	1.0	21.0	2	35	25	1	1.00	1.00	45.86332
Corn Pops	FALSE	K	cold	110	1	0	90	1.0	13.0	12	20	25	2	1.00	1.00	35.78279
Count Chocula	FALSE	G	cold	110	1	1	180	0.0	12.0	13	65	25	2	1.00	1.00	22.39651
Cracklin' Oat Bran	TRUE	K	cold	110	3	3	140	4.0	10.0	7	160	25	3	1.00	0.50	40.44877
Cream of Wheat (Quick)	FALSE	N	hot	100	3	0	80	1.0	21.0	0	-1	0	2	1.00	1.00	64.53382
Crispix	FALSE	K	cold	110	2	0	220	1.0	21.0	3	30	25	3	1.00	1.00	46.89564
Crispy Wheat & Raisins	FALSE	G	cold	100	2	1	140	2.0	11.0	10	120	25	3	1.00	0.75	36.17620
Double Chex	FALSE	R	cold	100	2	0	190	1.0	18.0	5	80	25	3	1.00	0.75	44.33086
Froot Loops	FALSE	K	cold	110	2	1	125	1.0	11.0	13	30	25	2	1.00	1.00	32.20758
Frosted Flakes	FALSE	K	cold	110	1	0	200	1.0	14.0	11	25	25	1	1.00	0.75	31.43597
Frosted Mini-Wheats	FALSE	K	cold	100	3	0	0	3.0	14.0	7	100	25	2	1.00	0.80	58.34514
Fruit & Fibre Dates; Walnuts; and Oats	FALSE	P	cold	120	3	2	160	5.0	12.0	10	200	25	3	1.25	0.67	40.91705
Fruitful Bran	TRUE	K	cold	120	3	0	240	5.0	14.0	12	190	25	3	1.33	0.67	41.01549
Fruity Pebbles	FALSE	P	cold	110	1	1	135	0.0	13.0	12	25	25	2	1.00	0.75	28.02576
Golden Crisp	FALSE	P	cold	100	2	0	45	0.0	11.0	15	40	25	1	1.00	0.88	35.25244
Golden Grahams	FALSE	G	cold	110	1	1	280	0.0	15.0	9	45	25	2	1.00	0.75	23.80404
Grape Nuts Flakes	FALSE	P	cold	100	3	1	140	3.0	15.0	5	85	25	3	1.00	0.88	52.07690
Grape-Nuts	FALSE	P	cold	110	3	0	170	3.0	17.0	3	90	25	3	1.00	0.25	53.37101
Great Grains Pecan	FALSE	P	cold	120	3	3	75	3.0	13.0	4	100	25	3	1.00	0.33	45.81172
Honey Graham Ohs	FALSE	Q	cold	120	1	2	220	1.0	12.0	11	45	25	2	1.00	1.00	21.87129
Honey Nut Cheerios	FALSE	G	cold	110	3	1	250	1.5	11.5	10	90	25	1	1.00	0.75	31.07222
Honey-comb	FALSE	P	cold	110	1	0	180	0.0	14.0	11	35	25	1	1.00	1.33	28.74241
Just Right Crunchy Nuggets	FALSE	K	cold	110	2	1	170	1.0	17.0	6	60	100	3	1.00	1.00	36.52368
Just Right Fruit & Nut	FALSE	K	cold	140	3	1	170	2.0	20.0	9	95	100	3	1.30	0.75	36.47151
Kix	FALSE	G	cold	110	2	1	260	0.0	21.0	3	40	25	2	1.00	1.50	39.24111
Life	FALSE	Q	cold	100	4	2	150	2.0	12.0	6	95	25	2	1.00	0.67	45.32807
Lucky Charms	FALSE	G	cold	110	2	1	180	0.0	12.0	12	55	25	2	1.00	1.00	26.73451
Maypo	FALSE	A	hot	100	4	1	0	0.0	16.0	3	95	25	2	1.00	1.00	54.85092
Muesli Raisins; Dates; & Almonds	FALSE	R	cold	150	4	3	95	3.0	16.0	11	170	25	3	1.00	1.00	37.13686
Muesli Raisins; Peaches; & Pecans	FALSE	R	cold	150	4	3	150	3.0	16.0	11	170	25	3	1.00	1.00	34.13976
Mueslix Crispy Blend	FALSE	K	cold	160	3	2	150	3.0	17.0	13	160	25	3	1.50	0.67	30.31335
Multi-Grain Cheerios	FALSE	G	cold	100	2	1	220	2.0	15.0	6	90	25	1	1.00	1.00	40.10596
Nut&Honey Crunch	FALSE	K	cold	120	2	1	190	0.0	15.0	9	40	25	2	1.00	0.67	29.92429
Nutri-Grain Almond-Raisin	FALSE	K	cold	140	3	2	220	3.0	21.0	7	130	25	3	1.33	0.67	40.69232
Nutri-grain Wheat	FALSE	K	cold	90	3	0	170	3.0	18.0	2	90	25	3	1.00	1.00	59.64284
Oatmeal Raisin Crisp	FALSE	G	cold	130	3	2	170	1.5	13.5	10	120	25	3	1.25	0.50	30.45084
Post Nat. Raisin Bran	TRUE	P	cold	120	3	1	200	6.0	11.0	14	260	25	3	1.33	0.67	37.84059
Product 19	FALSE	K	cold	100	3	0	320	1.0	20.0	3	45	100	3	1.00	1.00	41.50354
Puffed Rice	FALSE	Q	cold	50	1	0	0	0.0	13.0	0	15	0	3	0.50	1.00	60.75611
Puffed Wheat	FALSE	Q	cold	50	2	0	0	1.0	10.0	0	50	0	3	0.50	1.00	63.00565
Quaker Oat Squares	FALSE	Q	cold	100	4	1	135	2.0	14.0	6	110	25	3	1.00	0.50	49.51187
Quaker Oatmeal	FALSE	Q	hot	100	5	2	0	2.7	-1.0	-1	110	0	1	1.00	0.67	50.82839
Raisin Bran	TRUE	K	cold	120	3	1	210	5.0	14.0	12	240	25	2	1.33	0.75	39.25920
Raisin Nut Bran	TRUE	G	cold	100	3	2	140	2.5	10.5	8	140	25	3	1.00	0.50	39.70340
Raisin Squares	FALSE	K	cold	90	2	0	0	2.0	15.0	6	110	25	3	1.00	0.50	55.33314
Rice Chex	FALSE	R	cold	110	1	0	240	0.0	23.0	2	30	25	1	1.00	1.13	41.99893
Rice Krispies	FALSE	K	cold	110	2	0	290	0.0	22.0	3	35	25	1	1.00	1.00	40.56016
Shredded Wheat	FALSE	N	cold	80	2	0	0	3.0	16.0	0	95	0	1	0.83	1.00	68.23588
Shredded Wheat 'n'Bran	TRUE	N	cold	90	3	0	0	4.0	19.0	0	140	0	1	1.00	0.67	74.47295
Shredded Wheat spoon size	FALSE	N	cold	90	3	0	0	3.0	20.0	0	120	0	1	1.00	0.67	72.80179
Smacks	FALSE	K	cold	110	2	1	70	1.0	9.0	15	40	25	2	1.00	0.75	31.23005
Special K	FALSE	K	cold	110	6	0	230	1.0	16.0	3	55	25	1	1.00	1.00	53.13132
Strawberry Fruit Wheats	FALSE	N	cold	90	2	0	15	3.0	15.0	5	90	25	2	1.00	1.00	59.36399
Total Corn Flakes	FALSE	G	cold	110	2	1	200	0.0	21.0	3	35	100	3	1.00	1.00	38.83975
Total Raisin Bran	TRUE	G	cold	140	3	1	190	4.0	15.0	14	230	100	3	1.50	1.00	28.59278
Total Whole Grain	FALSE	G	cold	100	3	1	200	3.0	16.0	3	110	100	3	1.00	1.00	46.65884
Triples	FALSE	G	cold	110	2	1	250	0.0	21.0	3	60	25	3	1.00	0.75	39.10617
Trix	FALSE	G	cold	110	1	1	140	0.0	13.0	12	25	25	2	1.00	1.00	27.75330
Wheat Chex	FALSE	R	cold	100	3	1	230	3.0	17.0	3	115	25	1	1.00	0.67	49.78744
Wheaties	FALSE	G	cold	100	3	1	200	3.0	17.0	3	110	25	1	1.00	1.00	51.59219
Wheaties Honey Gold	FALSE	G	cold	110	2	1	200	1.0	16.0	8	60	25	1	1.00	0.75	36.18756

Tips for String Success

The real power of these str_xxx functions comes when you specify the pattern using regular expressions!

regex

Regular Expressions

“Regexps are a very terse language that allow you to describe patterns in strings.”

R for Data Science

Use str_xxx functions + regular expressions!

str_detect(string  = my_string_vector,
           pattern = "p[ei]ck[a-z]")

Tip

You might encounter gsub(), grep(), etc. from Base R, but I would highly recommending using functions from the stringr package instead.

Regular Expressions

…are tricky!

There are lots of new symbols to keep straight.
There are a lot of cases to think through.

This web app for testing R regular expressions might be handy!

Special Characters

There is a set of characters that have a specific meaning when using regex.

The stringr package does not read these as normal characters.
These characters are:

. ^ $ \ | * + ? { } [ ] ( )

Wild Card Character: `.`

This character can match any character.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = ".ells")

[1] "sells"     "seashells"

str_extract(x, pattern = ".ells")

[1] NA      "sells" "hells" NA      NA      NA

This matches strings that contain any character followed by “ells”.

Anchor Characters: `^ $`

^ – looks at the beginning of a string.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = "^s")

[1] "sells"     "seashells" "seashore!"

This matches strings that start with “s”.

$ – looks at the end of a string.

x <- c("She", 
       "sells", 
       "seashells", 
       "by", 
       "the", 
       "seashore!")

str_subset(x, pattern = "s$")

[1] "sells"     "seashells"

This matches strings that end with “s”.

Quantifier Characters: `? + *`

? – matches when the preceding character occurs 0 or 1 times in a row.

x <- c("shes", 
       "shels", 
       "shells", 
       "shellls", 
       "shelllls")

str_subset(x, pattern = "shel?s")

[1] "shes"  "shels"

+ – occurs 1 or more times in a row.

str_subset(x, pattern = "shel+s")

[1] "shels"    "shells"   "shellls"  "shelllls"

* – occurs 0 or more times in a row.

str_subset(x, pattern = "shel*s")

[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"

Quantifier Characters: `{}`

{n} – matches when the preceding character occurs exactly n times in a row.

x <- c("shes", 
       "shels", 
       "shells", 
       "shellls", 
       "shelllls")

str_subset(x, pattern = "shel{2}s")

[1] "shells"

{n,} – occurs at least n times in a row.

str_subset(x, pattern = "shel{2,}s")

[1] "shells"   "shellls"  "shelllls"

{n,m} – occurs between n and m times in a row.

str_subset(x, pattern = "shel{1,3}s")

[1] "shels"   "shells"  "shellls"

Character Classes: `[]`

Character classes let you specify multiple possible characters to match on.

x <- c("Peter", 
       "Piper", 
       "picked", 
       "a",
       "peck",
       "of",
       "pickled",
       "peppers!")

str_subset(x, pattern = "p[ei]ck")

[1] "picked"  "peck"    "pickled"

Matches you don’t want

[^ ] – specifies characters not to match on (think except)

str_subset(x, pattern = "p[^i]ck")

[1] "peck"

But remember that ^ outside of brackets specifies the first charatcter in a string.

str_subset(x, pattern = "^p")

[1] "picked"   "peck"     "pickled"  "peppers!"

str_subset(x, pattern = "^[^p]")

[1] "Peter" "Piper" "a"     "of"

Warning

Why do “Peter” and “Piper” not match "^[^p]"?

Capitilization matters!

Character Classes: `[]`

[ - ] – specifies a range of characters.

x <- c("Peter", 
       "Piper", 
       "picked", 
       "a",
       "peck",
       "of",
       "pickled",
       "peppers!")

str_subset(x, pattern = "p[ei]ck[a-z]")

[1] "picked"  "pickled"

[A-Z] matches any capital letter.
[a-z] matches any lowercase letter.
[A-z] or [:alpha:] matches any letter
[0-9] or [:digit:] matches any number

Shortcuts

\\w – matches any “word” (\\W matches not “word”)
- A “word” contains any letters and numbers.
\\d – matches any digit (\\D matches not digit)
\\s – matches any whitespace (\\S matches not whitespace)
- Whitespace includes spaces, tabs, newlines, etc.

x <- "phone number: 1234567899"

str_extract(x, pattern = "\\d+")

[1] "1234567899"

str_extract_all(x, pattern = "\\S+")

[[1]]
[1] "phone"      "number:"    "1234567899"

Character Groups: `()`

Groups are created with ( ).

We can specify “either” / “or” within a group using |.

x <- c("Peter", 
       "Piper", 
       "picked", 
       "a", 
       "peck",
       "of", 
       "pickled",
       "peppers!")

str_subset(x, pattern = "p(e|i)ck")

[1] "picked"  "peck"    "pickled"

This matches strings that contain either “peck” or “pick”.

Character Groups: `()`

We can then reference groups in order with escaped numbers (\\1) to specify that certain groupings repeat.

x <- c("hannah", 
       "had", 
       "a", 
       "ball", 
       "on",
       "a", 
       "race car")

str_subset(x, pattern = "^(.).*\\1$")

[1] "hannah"   "race car"

This matches strings that start and end with the same character.

Character Groups: `()`

Groups also let us be very precise with extracting strings!

shopping_list <- c("apples x4", 
                   "bag of flour", 
                   "bag of sugar", 
                   "milk x2")

str_extract(shopping_list, "([a-z]+) x([1-9])")

[1] "apples x4" NA          NA          "milk x2"

str_extract(shopping_list, "([a-z]+) x([1-9])", group = 1)

[1] "apples" NA       NA       "milk"

str_extract(shopping_list, "([a-z]+) x([1-9])", group = 2)

[1] "4" NA  NA  "2"

Try it out!

What regular expressions would match words that…

end with a vowel?
start with x, y, or z?
contains at least one digit?
contains two of the same letters in a row?

x <- c("zebra", 
       "xray", 
       "apple", 
       "yellow",
       "color", 
       "patt3rn",
       "g2g",
       "summarise")

Some Possible Solutions…

end with a vowel?

str_subset(x, "[aeiouy]$")

start with x, y, or z?

str_subset(x, "^[xyz]")

contain at least one digit?

str_subset(x, "[:digit:]")

contains two of the same letters in a row

str_subset(x, "([:alpha:])\\1")

Escape: `\\`

To match a special character, you need to escape it.

x <- c("How",
       "much", 
       "wood",
       "could",
       "a",
       "woodchuck",
       "chuck",
       "if",
       "a",
       "woodchuck",
       "could",
       "chuck",
       "wood?")

str_subset(x, pattern = "?")

Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)

Escape: `\\`

Use \\ to escape the ? – it is now read as a normal character.

str_subset(x, pattern = "\\?")

[1] "wood?"

Note

Alternatively, you could use []:

str_subset(x, pattern = "[?]")

[1] "wood?"

When in Doubt

Use the web app to test R regular expressions.

Tips for working with regex

Read the regular expressions out loud like a request.

Test out your expressions on small examples first.

str_view()

str_view(c("shes", "shels", "shells", "shellls", "shelllls"), "l+")

[2] │ she<l>s
[3] │ she<ll>s
[4] │ she<lll>s
[5] │ she<llll>s

Use the stringr cheatsheet.

Be kind to yourself!

More practice!

I want to join two datasets that have a county variable:

county_pop

county	pop
STORY	100000
BOONE	40000
MARSHALL	120000
POLK	500000

county_loc

county	region
Story	Central
Boone	Central
Marshall	East
Polk	Central

Practice

What stringr function will help me join the county_pop and county_loc by county?

More practice!

What if I want to pull out only the area code in a phone number?

phone_numbers <- c("(515)242-1958", "(507)598-1395", "(805)938-7639")

Practice

You will need a stringr function and to use regular expressions!

str_extract(phone_numbers, "\\(\\d{3}\\)")

[1] "(515)" "(507)" "(805)"

What if I want just the numbers in the area code?

str_extract(phone_numbers, "\\((\\d{3})\\)", group = 1)

[1] "515" "507" "805"

phone_numbers |> 
  str_extract(pattern = "\\(\\d{3}\\)") |> 
  str_remove_all(pattern = "[:punct:]")

[1] "515" "507" "805"

More practice! (last one)

awards_dat

awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

More practice! (last one)

awards_dat

awards
Beyonce: 35G, 0A, 0E
Kendrick Lamar: 22G, 0A, 1E
Charli XCX: 2G, 0A, 0E
Cynthia Erivo: 1G, 0A, 1E
Viola Davis: 1G, 1A, 1E
Elton John: 6G, 2A, 1E

That’s annoying…

Create a variable with just the artist name and a variable with the number of Grammys won.

awards_dat |> 
  mutate(artist = str_extract(awards, "[A-z\\s]+"),
         grammies = str_extract(awards, "([1-9]+)G", 
                                group = 1)) |> 
  select(artist, grammies)

artist	grammies
Beyonce	35
Kendrick Lamar	22
Charli XCX	2
Cynthia Erivo	1
Viola Davis	1
Elton John	6

PA 5.1: Scrambled Message

In this activity, you will use functions from the stringr package and regex to decode a message.

A pile of tiles from the game of Scrabble.

To do…

PA 5.1: Scrambled Message
- Due Thursday before class
LA 5: Murder in SQL City
- Due Monday at 11:59 pm
- You can use maximum 1 late day on this lab!
Look out for exam information posted on Canvas - we will discuss on Thursday

Using stringr to Work with Strings

Tuesday, April 29

Week 5 Layout

Week 6 Layout

String Variables

What is a string?

stringr

pattern =

str_detect()

str_match()

str_extract()

What do you mean by the first match?

str_locate()

str_subset()

Try it out!

Replace / Remove Patterns

Edit Strings

Combine Strings

Tips for String Success

Tips for String Success

regex

Regular Expressions

Regular Expressions

Special Characters

Wild Card Character: .

Anchor Characters: ^ $

Quantifier Characters: ? + *

Quantifier Characters: {}

Character Classes: []

Matches you don’t want

Character Classes: []

Shortcuts

Character Groups: ()

Character Groups: ()

Character Groups: ()

Try it out!

Some Possible Solutions…

Escape: \\

Escape: \\

When in Doubt

Tips for working with regex

More practice!

More practice!

More practice! (last one)

More practice! (last one)

PA 5.1: Scrambled Message

To do…

Using `stringr` to Work with Strings

`stringr`

`pattern =`

`str_detect()`

`str_match()`

`str_extract()`

`str_locate()`

`str_subset()`

Wild Card Character: `.`

Anchor Characters: `^ $`

Quantifier Characters: `? + *`

Quantifier Characters: `{}`

Character Classes: `[]`

Character Classes: `[]`

Character Groups: `()`

Character Groups: `()`

Character Groups: `()`

Escape: `\\`

Escape: `\\`