---
title: "STAT 331 Week 5 Day 1 Handout"
format: html
embed-resources: true
---

```{r}
#| label: setup

library(tidyverse)
colleges <- read_csv("https://www.dropbox.com/s/bt5hvctdevhbq6j/colleges.csv?dl=1")  |> 
  select(INSTNM, CITY, STABBR, ZIP)
```

Open and edit this handout in VISUAL mode to help with editing tables.

1.  What is a character vector versus a string?

2.  What inputs do `str_XXXX()` functions require?

## Regular Expressions

### Summary

Fill in the table on basic regular expression below

| Regular Expression | Example | What does this expression it mean (in words)? |
|-------------------|-------------------------|-----------------------------|
| `.` | `".ells"` |  |
| `\\` | `"\\."` |  |
| `^` | `"^California State"` |  |
| `$` | `"State University$"` |  |
| `[]` | `"[AC]"` |  |
| `[^]` | `"[^U]"` |  |
| `^[^]` | `"^[^U]"` |  |
| `[A-Z]` | `"[A-z]"` |  |
| `[0-9]` | `"[0-9]"` |  |
| `[:punct:]` | `"[:punct:]"` |  |
| `+` | `"St\\.+"` |  |
| `*` | `"\\s*-\\s*"` |  |
| `{a}` | `"[A-Z]{4}"` |  |
| `{a,b}` | `"[A-Z]{4,6}"` |  |
| `{a,}` | `"[A-Z]{4,}"` |  |
| `()` | `"(iss){2}"` |  |
| `|` | `"(T|t)ech"` |  |
| `()\\1` | `"([aeiou])\\1"` |  |

### Code for Testing

#### Anchors

```{r}
str_subset(colleges$INSTNM, 
           pattern = "^California State") |> 
  head()
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "State University$") |> 
  head()
```

#### Classes \[\]

```{r}
str_subset(colleges$INSTNM, 
           pattern = "^[AC]")  |> 
  head()
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "^[^U]")  |> 
  head()
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "[0-9]")  |> 
  head()
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "[:punct:]$")  |> 
  head()
```

#### Qualifiers

```{r}
str_subset(colleges$INSTNM, 
           pattern = "St\\.+")  |> 
  head()
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "\\s*-\\s*") |> 
  head()
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{4}")
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "[A-Z]{6}") |> 
  head()
```

#### Groups ()

```{r}
str_subset(colleges$INSTNM, 
           pattern = "(iss){2}")
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "(T|t)ech") |> 
  head()
```

```{r}
str_subset(colleges$INSTNM, 
           pattern = "([aeiou])\\1") |> 
  head()
```

## `stringr` functions

### Summary

Fill in the table on `stringr` functions below.

|  |  |  |
|-------------------|-------------------------|----------------------------|
| Function | **Example** | **What does this function do?** |
| `str_subset()` | `str_subset(my_vector, pattern = "Bond")` |  |
| `str_detect(, pattern = ...)` | `str_detect(my_vector,            pattern = "Bond")` |  |
| `str_replace(, pattern = ...,  replacement = ...)` | `str_replace(my_vector,             pattern = "Bond",             replace = "Franco")` |  |
| `str_remove(, pattern = ...)` | `str_remove(my_vector, pattern = "Bond")` |  |
| `str_length()` | `str_length(INSTNM)` |  |
| `str_sub(, start = ..., end = ...)` | `str_sub(INSTNM, start = 1, end = 8)` |  |
| `str_pad(, side = ...)` | `str_pad(INSTNM, width = 20, pad = "_")` |  |
| `str_extract(, pattern = ...)` | `str_extract(INSTNM, pattern = "[A-z]*\\s")` |  |
| `str_to_lower()` | `str_to_lower(INSTNM)` |  |
| `str_to_upper()` | `str_to_upper(INSTNM)` |  |
| `str_to_title()` | `str_to_title(INSTNM)` |  |
| `str_trim()` | `str_trim(INSTNM, side = "both")` |  |
| `str_squish()` | `str_squish(ISTNM)` |  |
| `str_c()` | `str_c(CITY, STABBR, ZIP, sep = ", ")` |  |

### Code for Testing

```{r}
colleges |> 
  filter(str_detect(INSTNM, pattern = "Polytechnic")) 
```

```{r}
colleges |> 
  summarize(n = sum(
    str_detect(INSTNM, pattern = "Polytechnic"))
    )
```

```{r}
colleges |> 
  mutate(INSTNM = str_replace(INSTNM, 
                              pattern = "University", 
                              replacement = "Uni")
         ) |> 
  slice_sample(n = 10)
```

```{r}
colleges |> 
  mutate(INSTNM = str_remove(INSTNM, 
                             pattern = "(College|University)"
                             )
         ) |> 
  slice_sample(n = 10)
```

```{r}
colleges |> 
  mutate(short_name = str_sub(INSTNM, 
                                  start = 1,
                                  end = 8)
         ) |> 
  select(INSTNM, short_name) |> 
  head()
```

```{r}
colleges |> 
  mutate(long_name = str_pad(INSTNM, 
                             width = 20, 
                             pad = "_", 
                             side = "both")) |> 
  select(INSTNM, long_name) |> 
  head()
```

```{r}
colleges |> 
  mutate(first_word = str_extract(INSTNM,
                                  pattern = "[A-z]*")) |> 
  select(INSTNM, first_word) |> 
  head()
```

```{r}
colleges |> 
  mutate(INSTNM = str_to_lower(INSTNM)) |> 
  head()
```

```{r}
colleges |> 
  mutate(INSTNM = str_to_upper(INSTNM)) |> 
  head()
```

```{r}
colleges |> 
  mutate(INSTNM = str_to_title(INSTNM)) |> 
  head()
```

```{r}
colleges |> 
  mutate(INSTNM = str_trim(INSTNM, side = "both")) |> 
  head()
```

```{r}

colleges |> 
  mutate(
    address = str_c(CITY, STABBR, ZIP, sep = ", ")
         ) |> 
  slice_sample(n = 10)
```