library(tidyverse)
library(broom)
library(ggformula)
library(fivethirtyeight)
library(knitr)
library(yardstick)
# add other packages as needed
Homework 06: Candy Competition
Inference for Multiple Linear Regression
Introduction
In today’s homework you will analyze data about candy that was collected from an online experiment conducted at FiveThirtyEight.
Learning goals
By the end of the homework you will be able to
- Fit a linear model with multiple predictors and an interaction term
- Fit a linear model with categorical predictors
- Conduct inference on multiple linear models
Getting started
Packages
The following packages are used in the lab.
Data: Candy
The data from this lab comes from the the article FiveThirtyEight The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy matchups (e.g. Reese’s vs. Skittles). Click here to check out some of the match ups.
The data set contains the characteristics and win percentage from 85 candies in the experiment. The variables are
Variable | Description |
---|---|
chocolate |
Does it contain chocolate? |
fruity |
Is it fruit flavored? |
caramel |
Is there caramel in the candy? |
peanutyalmondy |
Does it contain peanuts, peanut butter or almonds? |
nougat |
Does it contain nougat? |
crispedricewafer |
Does it contain crisped rice, wafers, or a cookie component? |
hard |
Is it a hard candy? |
bar |
Is it a candy bar? |
pluribus |
Is it one of many candies in a bag or box? |
sugarpercent |
The percentile of sugar it falls under within the data set. Values 0 - 1. |
pricepercent |
The unit price percentile compared to the rest of the set. Values 0 - 1. |
winpercent |
The overall win percentage according to 269,000 matchups. Values 0 - 100. |
Use the code below to get a glimpse of the candy_rankings
data frame in the fivethirtyeight R package.
glimpse(candy_rankings)
Rows: 85
Columns: 13
$ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…
Exercises
The goal of this analysis is to use multiple linear regression to understand the factors that make a good candy.
Exercise 1
Notice that the values of pricepercent
and sugarpercent
are proportions. User mutate
to change the scale so that they are percentages.
Exercise 2
- Our response variable in this homework will be
winpercentage
. Choose two additional variables, one quantitative and one categorical. Generate a SINGLE plot that visualizes all three variables. Hint: remember that you can tie any aesthetic in your plot (e.g. color) to a variable be writingaesthetic = ~variable_name
. For examplecolor = ~variable1
will color the points based onvariable1
. - Write two observations from your plot.
Exercise 3
Fit a linear model including both variables you chose above.
Exercise 4
Interpret the following in the context of the data:
Intercept
Coefficient of your quantitative variable
Coefficient(s) of your categorical variables
Exercise 5
Choose one of the coefficients from your model and write out all the steps in the hypothesis test that corresponds to it’s p-value as in this slide.
Exercise 6
Interpret the other p-values in your model in the context of the data. You do not need to write out the full hypothesis testing framework as you did in Exercise 5.
Exercise 7
Generate 95% confidence intervals for all of the slopes in your model and interpret them in the context of the data.
Exercise 8
Now fit a model using the same explanatory variables, but include an interaction term. Interpret the coefficient of your interaction term in the context of the data.
Exercise 9
Interpret the p-value associated with the interaction term.
Exercise 10
Based on the results from exercise 9, would you choose the model with or without the interaction term?
Grading
Total points available: 15 points.
Component | Points |
---|---|
Ex 1 | 1 |
Ex 2 | 2 |
Ex 3 | 1 |
Ex 4 | 1 |
Ex 5 | 1 |
Ex 6 | 1 |
Ex 7 | 2 |
Ex 8 | 2 |
Ex 9 | 1 |
Ex 10 | 1 |
Grammar & Writing | 11 |
Workflow & formatting | 12 |
Footnotes
The “Grammar & Writing” grade is decided based on your grammar and writing. This is typically decided by choosing one of the questions and assessing the writing.↩︎
The “Workflow & formatting” grade is to assess the reproducible workflow and document format. This includes having a neatly organized document with readable code and your name and the date in the YAML.↩︎