05:00
Navigate to Teams.
In a private chat to me answer the following questions:
In the class discussion forum, please recommend at least one song for the class playlist… Feel free to suggest as many songs as you like.
05:00
What are response and explanatory variables?
“In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or ‘criterion variable’) changes when any one of the independent variables is varied, while the other independent variables are held fixed.”
Source: Wikipedia (previous definition)
Note: I don’t really like the terms “independent” and “dependent” variables
New Yorkers Will Pay $56 A Month To Trim A Minute Off Their Commute
How FiveThirtyEight’s 2020 Presidential Forecast Works — And What’s Different Because Of COVID-19
Effect of Forensic Evidence on Criminal Justice Case Processing
Why it’s so freaking hard to make a good COVID-19 model (from March 2020)
Q - What background is assumed for the course?
A - Introductory statistics or previous experience with mathematics at a level that would allow you to learn intro stats concepts relatively easily
Q - Will we be doing computing?
A - Yes. We will use the computing language R for analysis and Quarto for writing up results.
Q - Am I expected to have experience using any of these tools?
A - No. I do not expect you to have any exposure to R and certainly not Quarto.
Q - Will we learn the mathematical theory of regression?
A - Yes and No. The course is primarily focused on application; however, we will discuss some of the mathematics of simple linear regression.
Q - How much time should I be spending on this class?
A - This is a 3-credit class taught over 15 days which meets for 2.5 hours per day. That means that you should be spending approximately 9 hours per day working on this course (i.e. 6.5 hours outside of class)
By the end of the semester, you will be able to…
What is a quantitative and what is a categorical variable?
Chapter | Response | Predictor/Explanatory |
---|---|---|
1-2 | Quantitative | Single Quantitative |
3-4 | Quantitative | Multiple Quantitative |
5 | Quantitative | Single Categorical |
6-8 | Quantitative | Multiple Categorical |
9 | Categorical | Single Quant/Cat |
10 | Categorical | Multiple Quant/Cat |
11 | Both | Both |
All analyses using R, a statistical programming language
Write reproducible reports in Quarto
Access RStudio through College of Idaho posit Workbench
Use your College of Idaho email and password
03:00
Prepare: Introduce new content and prepare for lectures by completing the readings (and sometimes watching videos)
Participate: Attend and actively participate in lectures, office hours, team meetings
Practice: Practice applying statistical concepts and computing with application exercises during lecture, graded for completion
Perform: Put together what you’ve learned to analyze real-world data
Homework assignments (individual)
Two oral exams
Final group projects
Category | Percentage |
---|---|
Homework | 25% |
Final Project | 25% |
Exam 01 | 20% |
Exam 02 | 20% |
Application Exercises | 10% |
Note: You must receive at least a 60% on your two exams to pass the course.
See the syllabus for details on how the final letter grade will be calculated.
AEs are due the day after the class they are assigned. No late work is accepted for application exercises, since these are designed as in-class activities to help you prepare for homework.
If an application exercise or project must be missed due to a school-sponsored event, you must let me know at least a week ahead of time so that we can schedule a time for you to make up the work before you leave. If you must miss a exam or a project presentation due to illness, you must let me know before class that day so that we can schedule a time for you to make it up. Failure to adhere to this policy will result in a 35% penalty on the corresponding assignment.
The College of Idaho maintains that academic honesty and integrity are essential values in the educational process. Operating under an Honor Code philosophy, the College expects conduct rooted in honesty, integrity, and understanding, allowing members of a diverse student body to live together and interact and learn from one another in ways that protect both personal freedom and community standards. Violations of academic honesty are addressed primarily by the instructor and may be referred to the Student Judicial Board.
By participating in this course, you are agreeing that all your work and conduct will be in accordance with the College of Idaho Honor Code.
I have policies!
Let’s read about them in the Academic honesty section of the syllabus
✅ AI tools for code: You may make use of the technology for coding examples on assignments or to fix bugs in your code. However, if you do so, you must explicitly cite where you obtained the code and AI should serve as a learning aid, not a replacement for thinking.
❌ No AI tools for narrative: Unless instructed otherwise, AI is not permitted for writing narrative on assignments.
Important
In general, you may use AI as a resource as you complete assignments but not to answer the exercises for you. You are ultimately responsible for the work you turn in; it should reflect your understanding of the course content. Any code or content from your homework which was not covered in class or cited, is eligible to be on your exams.
Complete all the preparation work (readings and videos) before class.
Ask questions.
Start your work (homework and projects) early!
Don’t procrastinate and don’t let a day pass by with lingering questions.
Stay up-to-date on announcements on Canvas and sent via email.
This class is a lot of work
Steep learning curve in the beginning… stick with it! I promise you can do it!
More writing than you probably expected… it is not enough for Dr. F to know what you mean to say… you must say that! Dr F. always asks: “If this student said this in a job interview, would they get hired?”
In statistics, there is rarely one RIGHT answer… it’s all about extracting information from data to make arguments
Showing up late to class
Using Generative AI to do your thinking for you
If you find a typo on the website, slides, homework, activities, etc (e.g. broken link, typo, etc…) and you are the first person to point it out, you will receive a bonus point toward your HW grade. However, you may not exceed 100%.
If you message me about this during class, you will not receive your extra credit.
Raise your hand or post on Teams
Source: R for Data Science with additions from The Art of Statistics: How to Learn from Data.
Source: R for Data Science
What does it mean for an analysis to be reproducible?
Near term goals:
✔️ Can the tables and figures be exactly reproduced from the code and data?
✔️ Does the code actually do what you think it does?
✔️ In addition to what was done, is it clear why it was done?
Long term goals:
✔️ Can the code be used for other data?
✔️ Can you extend the code to do other things?
Results produced are more reliable and trustworthy (Ostblom and Timbers 2022)
Facilitates more effective collaboration (Ostblom and Timbers 2022)
Contributing to science, which builds and organizes knowledge in terms of testable hypotheses (Alexander 2023)
Possible to identify and correct errors or biases in the analysis process (Alexander 2023)
Reproducibility error | Consequence | Source(s) |
---|---|---|
Limitations in Excel data formats | Loss of 16,000 COVID case records in the UK | (Kelion 2020) |
Automatic formatting in Excel | Important genes disregarded in scientific studies | (Ziemann, Eren, and El-Osta 2016) |
Deletion of a cell caused rows to shift | Mix-up of which patient group received the treatment | (Wallensteen et al. 2018) |
Using binary instead of explanatory labels | Mix-up of the intervention with the control group | (Aboumatar and Wise 2019) |
Using the same notation for missing data and zero values | Paper retraction | (Whitehouse et al. 2021) |
Incorrectly copying data in a spreadsheet | Delay in the opening of a hospital | (Picken 2020) |
Source: Ostblom and Timbers (2022)
Scriptability \(\rightarrow\) R
Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
Version control \(\rightarrow\) Git / GitHub (Beyond the scope of this course)
R is a statistical programming language
RStudio is a convenient interface for R (an integrated development environment, IDE)
Fully reproducible reports – the analysis is run from the beginning each time you render
Code goes in chunks and narrative goes outside of chunks
Visual editor to make document editing experience similar to a word processor (Google docs, Word, Pages, etc.)
Every application exercise and assignment is written in a Quarto document
You’ll have a template Quarto document to start with
The amount of scaffolding in the template will decrease over the semester
Any time we are working on AEs, I will randomly assign you to groups of two/three. Each person will have a role:
Complete through Exercise 15.
Starbucks often displays the total calories in their food items but not the other nutritional information.
Carbohydrates are a body’s main fuel source. The Dietary Guidelines for America recommend that carbohydrates make up 45% to 65% of total daily calories.1
Our goal is to understand the relationship between the amount of carbohydrates and calories in Starbucks food items. We’d also like to assess if the relationship differs based on the type of food item (bakery, salad, sandwich, etc.)
library(openintro)
starbucks <- starbucks |>
mutate(bakery = factor(if_else(type == "bakery", "bakery", "non-bakery")))
glimpse(starbucks)
Rows: 77
Columns: 8
$ item <chr> "8-Grain Roll", "Apple Bran Muffin", "Apple Fritter", "Banana…
$ calories <int> 350, 350, 420, 490, 130, 370, 460, 370, 310, 420, 380, 320, 3…
$ fat <dbl> 8, 9, 20, 19, 6, 14, 22, 14, 18, 25, 17, 12, 17, 21, 5, 18, 1…
$ carb <int> 67, 64, 59, 75, 17, 47, 61, 55, 32, 39, 51, 53, 34, 57, 52, 7…
$ fiber <int> 5, 7, 0, 4, 0, 5, 2, 0, 0, 0, 2, 3, 2, 2, 3, 3, 2, 3, 0, 2, 0…
$ protein <int> 10, 6, 5, 7, 0, 6, 7, 6, 5, 7, 4, 6, 5, 5, 12, 7, 8, 6, 0, 10…
$ type <fct> bakery, bakery, bakery, bakery, bakery, bakery, bakery, baker…
$ bakery <fct> bakery, bakery, bakery, bakery, bakery, bakery, bakery, baker…
carb
: Total carbohydrates (in grams)calories
: Total caloriesbakery
: bakery
: bakery food item, non-bakery
: other food typecarb
is the response variable
calories
, bakery
are the explanatory variables
\[\text{carb} = f(\text{calories}, \text{bakery}) + \epsilon\]
\[\text{carb} = \beta_0 + \beta_1 ~\text{calories} + \epsilon\]
\[\text{carb} = \beta_0 + \beta_1 ~\text{calories} + \beta_2 ~\text{bakery} + \epsilon\]
\[{\small \text{carb} = \beta_0 + \beta_1 ~\text{calories} + \beta_2 ~\text{bakery} + \beta_3 ~ \text{calories} \times \text{bakery} + \epsilon}\]
Size measurements, clutch observations, and blood isotope ratios for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.
What does the relationship between Length and Depth look like?
What does the relationship between Length and Depth look like now?
Simpson’s Paradox is when there is a clear relationship between two variables but when you introduce a third variable that relationship disappears of reverses.
Statistical model (also known as data-generating model)
\[{\small \text{carb} = \beta_0 + \beta_1 ~\text{calories} + \beta_2 ~\text{bakery} + \beta_3 ~ \text{calories} \times \text{bakery} + \epsilon}\]
Models the process for generating values of the response in the population (function + error)
Regression equation
Estimate of the function using the sample data
\[{\small \hat{\text{carb}} = \hat{\beta}_0 + \hat{\beta}_1 ~\text{calories} + \hat{\beta}_2 ~\text{bakery} + \hat{\beta}_3 ~ \text{calories} \times \text{bakery}}\]
Prediction: Expected value of the response variable for given values of the predictor variables
Inference: Conclusion about the relationship between the response and predictor variables
What is an example of a prediction question that can be answered using the model of carb vs. calories and bakery?
What is an example of an inference question that can be answered using the model of carb vs. calories and bakery?
We can use exploratory data analysis to describe the relationship between two variables
We make an assumption about the relationship between variables when doing linear regression
The two main objectives for fitting a linear regression model are (1) prediction and (2) inference