PA 3: Identify the Mystery College 🏫

Data Wrangling with dplyr

Today you will use the dplyr package to clean some data. We will then use that cleaned data to figure out what college Margaret has been accepted to.

Download starter .qmd file

Caution

Make sure you move your .qmd file from the Downloads folder into your Stat 331 directory – you should be working in your R project! DO NOT open it straight from download.

Dealing with Errors

As you work through this PA, you will encounter some code that does not work as you want it to. Don’t despair! Errors (when R is unable to run your code) and bugs (when the code doesn’t do what you hoped) are a natural part of coding. Even the best of the best deal with these problems regularly - learning to track down the issue is a skill that you can learn and practice.

Errors can be sneaky - check results often!

If a chunk of code runs smoothly without giving you an error or warning this does not necessarily mean it accomplished the desired task.

It is a good habit to check the results of your code every time you finish a task. In general, I would recommend completing the following tasks every time you write a code chunk.

Include a comment at the beginning of the code chunk that briefly states the purpose of the chunk. Comments in the code chunks come after # signs. These comments will remind later readers—which might be your future self!—what the desired output of the code chunk is.
If you created a new object, take a look at it! You can inspect the object by either clicking its name in your Environment tab or by typing its name into the console. Make sure it looks about how you expect. Do not type code to inspect the object in your Quarto file, as that is not code that needs to be saved!
If you created or updated a data frame, make sure your edits did what you hoped. Use the Environment or the head() function to investigate your changes.

Two heads are better than one!

It can be hard to spot bugs in code that you wrote.

Work with those around you - if something goes wrong, ask a friend or neighbor to take a peek at your code and see if any glaring errors pop out.

Explain your code out loud!

The best way to troubleshoot a sneaky bug is to explain, out loud, each step of your code, and what you hoped to accomplish.

Meet up with classmates to have everyone talk through their code.

If you are alone, try Rubber Duck Debugging!

Google is your friend!

The whole of the internet is at your disposal! Use it early, use it often.

Some tricks:

Copy-paste the exact error message into Google. Chances are, somebody else had a similar problem and got a similar message.
Include package names in your search terms. For example, “bar plot in ggplot” is a better search than “bar plot in R”.

Part One: Data Import and Cleaning

This section will clean today’s data so that you can use it more easily in Part Two.

Data Download & Package Loading

First, we declare our package dependencies and load the data.

Warning

The data loading function read_csv() will give you an outpouring of helpful information about the dataset. If you do not see the word “error”, there is nothing to be concerned about.

library(tidyverse)
colleges <- read_csv("https://www.dropbox.com/s/bt5hvctdevhbq6j/colleges.csv?dl=1")

Take a look at the variables in your downloaded data by running the following code. This code with the str (structure) function reports the data type for each column in the dataset.

str(colleges, give.attr = FALSE)

spc_tbl_ [7,058 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ...1                         : num [1:7058] 1 2 3 4 5 6 7 8 9 10 ...
 $ INSTNM                       : chr [1:7058] "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
 $ CITY                         : chr [1:7058] "Normal" "Birmingham" "Montgomery" "Huntsville" ...
 $ STABBR                       : chr [1:7058] "AL" "AL" "AL" "AL" ...
 $ ZIP                          : chr [1:7058] "35762" "35294-0110" "36117-3553" "35899" ...
 $ CONTROL                      : num [1:7058] 1 1 2 1 1 1 1 1 1 1 ...
 $ ADM_RATE                     : chr [1:7058] "0.9027" "0.9181" "NULL" "0.8123" ...
 $ SAT_AVG                      : chr [1:7058] "929" "1195" "NULL" "1322" ...
 $ TUITIONFEE_IN                : chr [1:7058] "9857" "8328" "6900" "10280" ...
 $ TUITIONFEE_OUT               : chr [1:7058] "18236" "19032" "6900" "21480" ...
 $ UGDS                         : chr [1:7058] "4824" "12866" "322" "6917" ...
 $ REGION                       : num [1:7058] 5 5 5 5 5 5 5 5 5 5 ...
 $ DEP_INC_PCT_H2               : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ MD_INC_WDRAW_2YR_TRANS_YR4_RT: chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ IND_COMP_4YR_TRANS_YR4_RT    : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ SD_EARN_WNE_P10              : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ FEMALE_WDRAW_4YR_TRANS_YR6_RT: chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ LO_INC_COMP_2YR_TRANS_YR3_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ NOLOAN_COMP_ORIG_YR4_RT      : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ OPENADMP                     : chr [1:7058] "2" "2" "1" "2" ...
 $ PELL_COMP_4YR_TRANS_YR3_RT   : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ DEATH_YR2_RT                 : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ NOLOAN_UNKN_ORIG_YR2_RT      : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ NOT1STGEN_WDRAW_ORIG_YR6_RT  : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ HI_INC_YR8_N                 : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ CUML_DEBT_P90                : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
 $ C100_L4                      : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...

Data Cleaning

Now we will clean the data. Alas, each of the R chunks in this section will cause an error and / or do the desired task incorrectly. Even the chunks that run without error are not correct! You will need to find the mistake and correct it to complete the intended action.

Step 1: There are too many variables in this data set. We don’t need all of them. Narrow your data set down to only:

INSTNM name of the institution
CITY city, STABBR state, and ZIP ZIP code of the institution
ADM_RATE admissions rate
SAT_AVG average SAT score
UGDS number of undergraduate students
TUITIONFEE_IN in- and TUITIONFEE_OUT out-of-state tuition
CONTROL Whether the school is public or private
REGION region of the school.

colleges_clean <- colleges | >  
  select(INSTNM, CITY, STABBR, ZIP,
         ADM_RATE, SAT_AVG, UGDS,
         TUITIONFEE_IN, TUITIONFEE_OUT
         CONTROL, REGION)

Error in parse(text = input): <text>:1:30: unexpected '>'
1: colleges_clean <- colleges | >
                                 ^

Step 2: Remove the schools that are for-profit (category 3), keeping public (category 1) and private schools (category 2).

colleges_clean <- colleges_clean |> 
  filter(CONTROL == 1, CONTROL == 2)

Error: object 'colleges_clean' not found

Step 3: Adjust the appropriate variables to be numeric, using as.numeric().

colleges_clean <- colleges_clean |> 
  mutate(TUITIONFEE_IN  = numeric(TUITIONFEE_IN),
         TUITIONFEE_OUT = numeric(TUITIONFEE_OUT),
         SAT_AVG        = numeric(SAT_AVG),
         UGDS           = numeric(UGDS),
         ADM_RATE       = numeric(ADM_RATE))

Error: object 'colleges_clean' not found

Step 4: Adjust the appropriate variables to be factors, using as.factor().

Note

We will talk more about special data types (including factors) in a few weeks.

colleges_clean <- colleges_clean |>
  mutate(CONTROL = as.character(CONTROL),
         REGION  = as.character(REGION))

Error: object 'colleges_clean' not found

Step 5: Create a new variable called TUITION_DIFF which contains the difference between out-of-state and in-state costs.

colleges_clean |> 
    TUITION_DIFF = TUITIONFEE_OUT - TUITIONFEE_IN

Error in TUITION_DIFF: The pipe operator requires a function call as RHS (<input>:2:5)

Step 6: Create a new variable called TOTAL_IN which contains the total amount of money made from tuition per year. (Note, this is just an approximation by mutliplying the number of undergrads by the in-state tuition)

colleges_clean <- colleges_clean |> 
    select(TOTAL_IN = UGDS x TUITIONFEE_IN)

Error in parse(text = input): <text>:2:28: unexpected symbol
1: colleges_clean <- colleges_clean |> 
2:     select(TOTAL_IN = UGDS x
                              ^

Step 7: Remove every row with missing data.

Warning

This is not always a great idea! Usually, even if some of the information is missing, we don’t want to throw out the entire row. This time, however, we’ll be lazy.

colleges_clean <- colleges_clean |> 
  drop.na()

Error in drop.na(colleges_clean): could not find function "drop.na"

Lastly, notice that each of these steps started with

colleges_clean <- colleges_clean |> ...

That is pretty redundant! Instead, we could perform all these tasks as one long “pipeline.”

Step 8: Combine your (fixed) code chunks into a single code chunk that carries out all of the steps necessary to clean the data and save it as colleges_clean.

Tip

Think about coding efficiency – you should not have multiple calls to the same function!

# Code combining ALL of your previous steps into ONE pipeline

Part Two: Identify the Mystery College

Wow! Your best friend Margaret has been accepted to her top choice college! Unfortunately, Margaret is a very mysterious person, and she won’t tell you directly which college this is. You’ll have to use her clues to figure out which school is her dream school.

You should use the cleaned version of the data (colleges_clean) and these clues to find her dream school!

Clues:

This college is located in Region 4.
This college’s admission rate is above the median rate for the region.
This college charges more for out-of-state tuition than in-state tuition.
More than 10,000 undergraduates attend this college.
The total amount of money made from tuition per year for this college is at least $200 million. (Use the TOTAL_IN variable you created)
The average SAT score of this college is an odd number.
Out of the two remaining colleges, this college has the least expensive tuition. While you could just glance at the result to find this, practice using a dplyr verb to get down to one college.

# Code to solve for the mystery college using the 7 clues above.

Submit the college Margaret will attend to the Canvas Quiz.