Intro to STAT 331/531 + Intro to R

Tuesday, April 1

Today we will…

  • Welcome to Stat 331/531: Statistical Computing in R
  • Intro to Me + You + the course
  • “Unplugged” Warm-up
  • Intro to R + RStudio
  • Organization
  • R Basics & Troubleshooting
  • PA 1: Find the Mistakes

Introductions

Me!

Hi, I’m Dr. C!

  • I am a transplant to the west coast.

  • My favorite things right now are cooking, knitting, and biking around SLO county.

  • I genuinely love R.

You!

I am looking forward to reading your introductions on Discord!

  • Please read the intros of your classmates so you can discover who you will be learning with this quarter.

Syllabus

Communication

  • ✉️ email
    • with your @calpoly.edu email address
    • with questions that relate to you as an individual
  • 💬 Discord
    • any and all questions!!
    • you will join and introduce yourself before Thursday
  • 🏢 office hours
    • come and discuss anything during my scheduled hours!
    • reach out to schedule other times

Course Materials

  • 🏫 Canvas
    • everything will be linked or posted on Canvas!
    • refer to Canvas for current deadilnes & announcements
  • 📕 “textbook”
    • online course notes
    • includes videos & tutorials
  • 🖥️ materials website
    • slides, labs, and practice activities are organized
    • this is for convenience, but everything will also be linked in Canvas

Assessments - Formative

  • Check-In’s (5%)
    • “check” that you are prepared for the week
    • based on reading from the textbook
  • Practice Activities (10%)
    • first attempt at new skills
    • get the hang of the week’s R skills
  • Lab Activities (30%)
    • “homework”

Work with other’s!

You should work with other’s on all assignments! The final work you submit for a Lab should be your own.

Assessments - Evaluative

  • Exams
    • Midterm Exam (15%)
    • Final Exam (25%)
  • Final project (15%)
    • group project
    • due end of week 10

Exam Conflicts

If you have a known conflict with an exam, please discuss it with me at least three weeks prior to the exam date

Assessments - General Criteria

  • correct outputs are great, but you will also need to demonstrate that you are intellectually engaging with the material in assignments
  • in addition to getting a correct output from your code, you will be assessed for efficiency and formatting
  • technical communication will also be an important element of assessments
  • I hope you will be proud of your work in this class!

Weekly Overview

Course Policies

Late Policy

  • 4 “deadline extensions”
  • email me before the deadline for a 24 hour extention

Attendance

Accessibility and Accomodations

  • let me know what I can do better
  • email me if you use the DRC

Academic Integrity

  • let’s be proactive to prevent situations where you are overwhelmed!
  • use Chat GPT as a tutor, not a substitute for your own work
  • cite outside resources!
  • please review the syllabus carefully

My expectations for you

  • ask lots of questions
  • take advantage of resources to help you learn
    • this includes working with others!
  • engage with what we are all doing together during class
  • work towards independent learning

My expectations for me

  • creating an inclusive and accessible classroom
  • providing resources needed to learn the material
  • providing prompt and clear feedback
  • doing my best to make class worth your time
  • clearly communicating

What to expect from this course

  • everyone comes with different knowledge and skills – figure out what helps you learn the best in this course
  • independent learning and learning from documentation is part of the course
  • coding involves unique challenges, frustrations, and satisfaction!

“Unplugged” Warmup

Statistical computing involves…

  • understanding data structures
    • i.e. “what you have to work with”
  • developing an algorithm
    • i.e. “what you want the computer to do”
  • knowing the syntax of a specific language
    • i.e. “how to tell the computer what to do so it will understand”
  • improving efficiency of your code

Not just syntax

All of these skills improve as the others improve and we will work on all of them in this course!

Paper Planes

  1. Work with the person next to you & introduce yourself!

  2. Write a set of instructions to fold a paper airplane

  1. Swap the instructions with the pair next to you
  2. Follow their instructions as literally as possible to fold a plane

Intro to R

What is R?

  • R is a programming language designed originally for statistical analyses.
  • R is open source

What’ going on with R?

R’s strengths are…

  • handling data with lots of different types of variables.

  • making nice and complex data visualizations.

  • having cutting-edge statistical methods available to users.

R’s weaknesses are…

  • performing non-analysis programming tasks, like website creation (python, ruby, …).

  • hyper-efficient numerical computation (matlab, C, …).

  • being a simple tool for all audiences (SPSS, STATA, JMP, minitab, …).

But wait!

Packages

The heart and soul of R are packages.

  • These are “extra” sets of code that add new functionality to R when installed.
  • “base” R refers to functions that are in the R software, and do not require a package.

Open-Source

Importantly, R is open-source.

  • There is no company that owns R, like there is for SAS or Matlab.
  • This means packages are created by users like you and me!
  • “Official” R packages live on the Comprehensive R Archive Network, or CRAN.
  • But anyone can write and share new code in “package form”

Open-Source

Being a good open-source citizen means…

  • sharing your code publicly when possible (later in this course, we’ll learn about GitHub!).
  • contributing to public projects and packages, as you are able.
  • creating your own packages, if you can.
  • using R for ethical and respectful projects.

Intro to

Let’s get into it!

  • If you have RStudio installed on your computer, open it now.

  • Otherwise, open the Posit Cloud Project

What is RStudio?

RStudio is an IDE (Integrated Developer Environment).

  • This means it is an application that makes it easier for you to interact with R.

RStudio is Your Friend!

  • You will always interact with R through RStudio

  • Helps with organization and some “point-and-click” options if desired

Organization & Directories

A lot of pieces go into a data analysis!

  • files with code
  • data files
  • documentation
  • images
  • reports
  • etc.

A frustrated looking little monster in front of a very disorganized cooking area, with smoke and fire coming from a pot surrounded by a mess of bowls, utensils, and scattered ingredients.

Artwork by Allison Horst

Organization is key!

An organized kitchen with sections labeled 'tools', 'report' and 'files', while a monster in a chef's hat stirs in a bowl labeled 'code.'

Artwork by Allison Horst

What is a directory?

  • A directory is just a fancy name for a folder.

  • Directories are your friends!

  • Best practice:

    • everything should have a place in a well-named directory
    • do not include spaces in directory names

Project Directory Examples

Class Directory Examples

Manage your Class Directory

Create a directory for this class!

  • Is it in a place you can easily find it?

    • NOT IN YOUR DOWNLOADS FOLDER ☠️
  • Does it have an informative name?

    • like “stat-331” or similar
  • Are the files inside it well-organized?

Warning

I cannot stress how important this is!!

R Basics

Open our “starter” notes to follow along

  • If on your computer: Download and save w1-notes.qmd on your computer and open it in RStudio.

  • If on Posit Cloud: Open “w1-notes.qmd”

How do I run code in R?

  • You can run any lines of code directly in the Console
    • But then your code isn’t saved!
    • This is best for exporatory and one-off tasks
  • We primarily write and run code in R scripts or notebooks in Source
    • These are documents where you can organize and save code!
    • More on scripts vs. notebooks tomorrow

Packages

To install a package use:

install.packages("tibble")
  • You should have to install a package only once.

To load a package use:

library(tibble)
  • You have to load a package each time you restart R.
  • You also should load packages at the beginning of any script.

Warning

Note that when you install packages you need to include quotation marks around the package name, but you don’t need to when loading a package!

Data Types

  • A value is a basic unit of stuff that a program works with.

  • Values have types:

  1. logical / boolean: FALSE/TRUE or 0/1 values.
  1. integer: whole numbers.
  1. double / float / numeric: decimal numbers.
  1. character / string - holds text, usually enclosed in quotes.

Variables

Variables are names that refer to values.

  • A variable is like a container that holds something - when you refer to the container, you get whatever is stored inside.

  • We assign values to variables using the syntax object_name <- value.

    • You can read this as “object name gets value” in your head.
message <- "So long and thanks for all the fish"
year <- 2025
the_answer <- 42
earth_demolished <- FALSE

Naming Conventions

  • some_people_use_snake_case
  • somePeopleUseCamelCase
  • some.people.use.periods
  • A few people mix conventions with variables_thatLookLike.this and they are almost universally hated.

Tip

Just pick one and stick with it!

Data Structures

Homogeneous: every element has the same data type.

  • Vector: a one-dimensional column of homogeneous data.

  • Matrix: the next step after a vector - it’s a set of homogenous data arranged in a two-dimensional, rectangular format.

Heterogeneous: the elements can be of different types.

  • List: a one-dimensional column of heterogeneous data.

  • Dataframe: a two-dimensional set of heterogeneous data arranged in a rectangular format.

Indexing

We use square brackets ([]) to access elements within data structures.

  • In R, we start indexing from 1.

Vector:

vec[4]    # 4th element
vec[1:3]  # first 3 elements

Matrix:

mat[2,6]  # element in row 2, col 6
mat[,3]   # all elements in col 3

List:

li[[5]]    # 5th element

Dataframe:

df[1,2]     # element in row 1, col 2
df[17,]     # all elements in row 17
df$calName  # all elements in the col named "colName"
df[["calName"]] # all elements in the col named "colName"

Logic

We can combine logical statements using and, or, and not.

  • (X AND Y) requires that both X and Y are true.

  • (X OR Y) requires that one of X or Y is true.

  • (NOT X) is true if X is false, and false if X is true.

x <- c(TRUE, FALSE, TRUE, FALSE)
y <- c(TRUE, TRUE, FALSE, FALSE)

x & y   # AND
[1]  TRUE FALSE FALSE FALSE
x | y   # OR
[1]  TRUE  TRUE  TRUE FALSE
!x & y  # NOT X AND Y
[1] FALSE  TRUE FALSE FALSE

Vectorization

  • R is designed to do vector and matrix math nicely

  • Many operations in R are vectorized.

    • These functions operate on vectors of values rather than a single value.
    • i.e. the function applies to every element of a vector individually.

e.g.:

x <- seq(from = -2, to = 2)
x
[1] -2 -1  0  1  2

Say we want to add 1 to every element of x

Vectorization

Loop:

for(i in 1:length(x)){
  x[i] <- x[i] + 1
}
x
[1] -1  0  1  2  3

Vectorized:

x <- x + 1
x
[1] -1  0  1  2  3
  • See how leveraging vectorization in R is great?

Loops be gone!

Now forget about loops after this slide! We rarely need them in R and will avoid using them in this class

Troubleshooting Errors!

Syntax Errors

Did you leave off a parenthesis?

seq(from = 1, to = 10, by = 1

seq(from = 1, to = 10, by = 1
Error in parse(text = input): <text>:2:0: unexpected end of input
1: seq(from = 1, to = 10, by = 1
   ^
seq(from = 1, to = 10, by = 1)
 [1]  1  2  3  4  5  6  7  8  9 10

Did you leave off a comma?

seq(from = 1, to = 10 by = 1)

seq(from = 1, to = 10 by = 1)
Error in parse(text = input): <text>:1:23: unexpected symbol
1: seq(from = 1, to = 10 by
                          ^
seq(from = 1, to = 10, by = 1)
 [1]  1  2  3  4  5  6  7  8  9 10

Did you make a typo? Are you using the right names?

sequence(from = 1, to = 10, by = 1)

sequence(from = 1, to = 10, by = 1)
Error in sequence.default(from = 1, to = 10, by = 1): argument "nvec" is missing, with no default
seq(from = 1, to = 10, by = 1)
 [1]  1  2  3  4  5  6  7  8  9 10

Object Type Errors

Are you using the right input that the function expects?

sqrt(‘1’)

sqrt('1')
Error in sqrt("1"): non-numeric argument to mathematical function
sqrt(1)
[1] 1

Are you expecting the right output of the function?

my_obj <- seq(from = 1, to = 10, by = 1)

my_obj(5)
Error in my_obj(5): could not find function "my_obj"
my_obj[5]
[1] 5

Errors + Warnings + Messages

Messages

Just because you see scary red text, this does not mean something went wrong! This is just R communicating with you.

  • For example, you will often see:
library(lme4)
Loading required package: Matrix

Warnings

Often, R will give you a warning.

  • This means that your code did run…

  • …but you probably want to make sure it succeeded.

Does this look right?

my_vec <- c("a", "b", "c")

my_new_vec <- as.numeric(my_vec)
Warning: NAs introduced by coercion
my_new_vec
[1] NA NA NA

Errors

If the word Error appears in your message from R, then you have a problem.

  • This means your code could not run!
my_vec <- c("a", "b", "c")

my_new_vec <- my_vec + 1
Error in my_vec + 1: non-numeric argument to binary operator

R what are you trying to tell me??

R says…

Error in ggplot() : could not find function “ggplot”

It probably means…

You haven’t installed/loaded the package that includes the ggplot() function OR you mispelled the function name

You should:

  1. check if the package is installed
  2. load the package (library())

R says…

Error: Object some_obj not found.

It probably means…

You haven’t run the code to create some_obj OR you have a typo in the name!

some_ojb <- 1:10

mean(some_obj)
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mean': object 'some_obj' not found

R says…

Error: Object of type ‘closure’ is not subsettable.

It probably means…

Oops, you tried to use square brackets on a function

mean[1, 2]
Error in mean[1, 2]: object of type 'closure' is not subsettable

R says…

Error: Non-numeric argument to binary operator.

It probably means…

You tried to do math on data that isn’t numeric.

"a" + 2
Error in "a" + 2: non-numeric argument to binary operator

What if none of these solved my error?

  • Look at the help file for the function!

  • Break what you are doing down into smaller pieces and look at whether or not each step gives you what you expect

  • When all else fails, Google your error message.

    • Include the function you are using.

Try it…

What’s wrong here?

matrix(c("a", "b", "c", "d"), num_row = 2)
Error in matrix(c("a", "b", "c", "d"), num_row = 2): unused argument (num_row = 2)

Tip

Look up the help file for the function matrix by running ?matrix in the Console.

PA 1: Find the Mistakes

Part One:

This file has many mistakes in the code. Some are errors that will prevent the file from knitting; some are mistakes that do NOT result in an error.

Fix all the problems in the code chunks.

Part Two:

Follow the instructions in the file to uncover a secret message.

Submit the name of the poem as the answer to the Canvas Quiz question.

To do…

  • Read Chapter 1: Introduction
  • Check-ins 1.1 - 1.4
    • Due Thursday (4/3) before class
  • PA 1: Find the Mistakes
    • Due Friday (4/4) at 11:59 pm