library(tidyverse)
<- read_csv("https://www.dropbox.com/s/bt5hvctdevhbq6j/colleges.csv?dl=1") colleges
PA 3: Identify the Mystery College 🏫
Data Wrangling with dplyr
Today you will use the dplyr
package to clean some data. We will then use that cleaned data to figure out what college Margaret has been accepted to.
Make sure you move your .qmd file from the Downloads folder into your Stat 331 directory – you should be working in your R project! DO NOT open it straight from download.
As you work through this PA, you will encounter some code that does not work as you want it to. Don’t despair! Errors (when R is unable to run your code) and bugs (when the code doesn’t do what you hoped) are a natural part of coding. Even the best of the best deal with these problems regularly - learning to track down the issue is a skill that you can learn and practice.
Errors can be sneaky - check results often!
If a chunk of code runs smoothly without giving you an error or warning this does not necessarily mean it accomplished the desired task.
It is a good habit to check the results of your code every time you finish a task. In general, I would recommend completing the following tasks every time you write a code chunk.
Include a comment at the beginning of the code chunk that briefly states the purpose of the chunk. Comments in the code chunks come after
#
signs. These comments will remind later readers—which might be your future self!—what the desired output of the code chunk is.If you created a new object, take a look at it! You can inspect the object by either clicking its name in your Environment tab or by typing its name into the console. Make sure it looks about how you expect. Do not type code to inspect the object in your Quarto file, as that is not code that needs to be saved!
If you created or updated a data frame, make sure your edits did what you hoped. Use the Environment or the
head()
function to investigate your changes.
Two heads are better than one!
It can be hard to spot bugs in code that you wrote.
Work with those around you - if something goes wrong, ask a friend or neighbor to take a peek at your code and see if any glaring errors pop out.
Explain your code out loud!
The best way to troubleshoot a sneaky bug is to explain, out loud, each step of your code, and what you hoped to accomplish.
Meet up with classmates to have everyone talk through their code.
If you are alone, try Rubber Duck Debugging!
Google is your friend!
The whole of the internet is at your disposal! Use it early, use it often.
Some tricks:
Copy-paste the exact error message into Google. Chances are, somebody else had a similar problem and got a similar message.
Include package names in your search terms. For example, “bar plot in ggplot” is a better search than “bar plot in R”.
Part One: Data Import and Cleaning
This section will clean today’s data so that you can use it more easily in Part Two.
Data Download & Package Loading
First, we declare our package dependencies and load the data.
The data loading function read_csv()
will give you an outpouring of helpful information about the dataset. If you do not see the word “error”, there is nothing to be concerned about.
Take a look at the variables in your downloaded data by running the following code. This code with the str
(structure) function reports the data type for each column in the dataset.
str(colleges, give.attr = FALSE)
spc_tbl_ [7,058 Ă— 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ...1 : num [1:7058] 1 2 3 4 5 6 7 8 9 10 ...
$ INSTNM : chr [1:7058] "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
$ CITY : chr [1:7058] "Normal" "Birmingham" "Montgomery" "Huntsville" ...
$ STABBR : chr [1:7058] "AL" "AL" "AL" "AL" ...
$ ZIP : chr [1:7058] "35762" "35294-0110" "36117-3553" "35899" ...
$ CONTROL : num [1:7058] 1 1 2 1 1 1 1 1 1 1 ...
$ ADM_RATE : chr [1:7058] "0.9027" "0.9181" "NULL" "0.8123" ...
$ SAT_AVG : chr [1:7058] "929" "1195" "NULL" "1322" ...
$ TUITIONFEE_IN : chr [1:7058] "9857" "8328" "6900" "10280" ...
$ TUITIONFEE_OUT : chr [1:7058] "18236" "19032" "6900" "21480" ...
$ UGDS : chr [1:7058] "4824" "12866" "322" "6917" ...
$ REGION : num [1:7058] 5 5 5 5 5 5 5 5 5 5 ...
$ DEP_INC_PCT_H2 : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ MD_INC_WDRAW_2YR_TRANS_YR4_RT: chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ IND_COMP_4YR_TRANS_YR4_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ SD_EARN_WNE_P10 : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ FEMALE_WDRAW_4YR_TRANS_YR6_RT: chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ LO_INC_COMP_2YR_TRANS_YR3_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ NOLOAN_COMP_ORIG_YR4_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ OPENADMP : chr [1:7058] "2" "2" "1" "2" ...
$ PELL_COMP_4YR_TRANS_YR3_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ DEATH_YR2_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ NOLOAN_UNKN_ORIG_YR2_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ NOT1STGEN_WDRAW_ORIG_YR6_RT : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ HI_INC_YR8_N : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ CUML_DEBT_P90 : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
$ C100_L4 : chr [1:7058] "NULL" "NULL" "NULL" "NULL" ...
Data Cleaning
Now we will clean the data. Alas, each of the R chunks in this section will cause an error and / or do the desired task incorrectly. Even the chunks that run without error are not correct! You will need to find the mistake and correct it to complete the intended action.
Step 1: There are too many variables in this data set. We don’t need all of them. Narrow your data set down to only:
INSTNM
name of the institutionCITY
city,STABBR
state, andZIP
ZIP code of the institutionADM_RATE
admissions rateSAT_AVG
average SAT scoreUGDS
number of undergraduate studentsTUITIONFEE_IN
in- andTUITIONFEE_OUT
out-of-state tuitionCONTROL
Whether the school is public or privateREGION
region of the school.
<- colleges | >
colleges_clean select(INSTNM, CITY, STABBR, ZIP,
ADM_RATE, SAT_AVG, UGDS,
TUITIONFEE_IN, TUITIONFEE_OUT CONTROL, REGION)
Error in parse(text = input): <text>:1:30: unexpected '>'
1: colleges_clean <- colleges | >
^
Step 2: Remove the schools that are for-profit (category 3), keeping public (category 1) and private schools (category 2).
<- colleges_clean |>
colleges_clean filter(CONTROL == 1, CONTROL == 2)
Error: object 'colleges_clean' not found
Step 3: Adjust the appropriate variables to be numeric, using as.numeric()
.
<- colleges_clean |>
colleges_clean mutate(TUITIONFEE_IN = numeric(TUITIONFEE_IN),
TUITIONFEE_OUT = numeric(TUITIONFEE_OUT),
SAT_AVG = numeric(SAT_AVG),
UGDS = numeric(UGDS),
ADM_RATE = numeric(ADM_RATE))
Error: object 'colleges_clean' not found
Step 4: Adjust the appropriate variables to be factors, using as.factor()
.
We will talk more about special data types (including factors) in a few weeks.
<- colleges_clean |>
colleges_clean mutate(CONTROL = as.character(CONTROL),
REGION = as.character(REGION))
Error: object 'colleges_clean' not found
Step 5: Create a new variable called TUITION_DIFF
which contains the difference between out-of-state and in-state costs.
|>
colleges_clean = TUITIONFEE_OUT - TUITIONFEE_IN TUITION_DIFF
Error in TUITION_DIFF: The pipe operator requires a function call as RHS (<input>:2:5)
Step 6: Create a new variable called TOTAL_IN
which contains the total amount of money made from tuition per year. (Note, this is just an approximation by mutliplying the number of undergrads by the in-state tuition)
<- colleges_clean |>
colleges_clean select(TOTAL_IN = UGDS x TUITIONFEE_IN)
Error in parse(text = input): <text>:2:28: unexpected symbol
1: colleges_clean <- colleges_clean |>
2: select(TOTAL_IN = UGDS x
^
Step 7: Remove every row with missing data.
This is not always a great idea! Usually, even if some of the information is missing, we don’t want to throw out the entire row. This time, however, we’ll be lazy.
<- colleges_clean |>
colleges_clean drop.na()
Error in drop.na(colleges_clean): could not find function "drop.na"
Lastly, notice that each of these steps started with
<- colleges_clean |> ... colleges_clean
That is pretty redundant! Instead, we could perform all these tasks as one long “pipeline.”
Step 8: Combine your (fixed) code chunks into a single code chunk that carries out all of the steps necessary to clean the data and save it as colleges_clean
.
Think about coding efficiency – you should not have multiple calls to the same function!
# Code combining ALL of your previous steps into ONE pipeline
Part Two: Identify the Mystery College
Wow! Your best friend Margaret has been accepted to her top choice college! Unfortunately, Margaret is a very mysterious person, and she won’t tell you directly which college this is. You’ll have to use her clues to figure out which school is her dream school.
You should use the cleaned version of the data (colleges_clean
) and these clues to find her dream school!
Clues:
This college is located in Region 4.
This college’s admission rate is above the median rate for the region.
This college charges more for out-of-state tuition than in-state tuition.
More than 10,000 undergraduates attend this college.
The total amount of money made from tuition per year for this college is at least $200 million. (Use the
TOTAL_IN
variable you created)The average SAT score of this college is an odd number.
Out of the two remaining colleges, this college has the least expensive tuition. While you could just glance at the result to find this, practice using a dplyr verb to get down to one college.
# Code to solve for the mystery college using the 7 clues above.
Submit the college Margaret will attend to the Canvas Quiz.