Homework 06: Candy Competition

Inference for Multiple Linear Regression

Introduction

In today’s homework you will analyze data about candy that was collected from an online experiment conducted at FiveThirtyEight.

Learning goals

By the end of the homework you will be able to

  • Fit a linear model with multiple predictors and an interaction term
  • Fit a linear model with categorical predictors
  • Conduct inference on multiple linear models

Getting started

  • Go to RStudio and login with your College of Idaho Email and Password.

  • Make a subfolder in your hw directory to store this homework.

  • Log into Canvas, navigate to Homework 6 and upload the hw-06.qmd file into the folder your just made.

Packages

The following packages are used in the lab.

library(tidyverse)
library(broom)
library(ggformula)
library(fivethirtyeight)
library(knitr)
library(yardstick)

# add other packages as needed

Data: Candy

The data from this lab comes from the the article FiveThirtyEight The Ultimate Halloween Candy Power Ranking by Walt Hickey. To collect data, Hickey and collaborators at FiveThirtyEight set up an experiment people could vote on a series of randomly generated candy matchups (e.g. Reese’s vs. Skittles). Click here to check out some of the match ups.

The data set contains the characteristics and win percentage from 85 candies in the experiment. The variables are

Variable Description
chocolate Does it contain chocolate?
fruity Is it fruit flavored?
caramel Is there caramel in the candy?
peanutyalmondy Does it contain peanuts, peanut butter or almonds?
nougat Does it contain nougat?
crispedricewafer Does it contain crisped rice, wafers, or a cookie component?
hard Is it a hard candy?
bar Is it a candy bar?
pluribus Is it one of many candies in a bag or box?
sugarpercent The percentile of sugar it falls under within the data set. Values 0 - 1.
pricepercent The unit price percentile compared to the rest of the set. Values 0 - 1.
winpercent The overall win percentage according to 269,000 matchups. Values 0 - 100.

Use the code below to get a glimpse of the candy_rankings data frame in the fivethirtyeight R package.

glimpse(candy_rankings)
Rows: 85
Columns: 13
$ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate        <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

Exercises

The goal of this analysis is to use multiple linear regression to understand the factors that make a good candy.

Exercise 1

Notice that the values of pricepercent and sugarpercent are proportions. User mutate to change the scale so that they are percentages.

Exercise 2

  • Our response variable in this homework will be winpercentage. Choose two additional variables, one quantitative and one categorical. Generate a SINGLE plot that visualizes all three variables. Hint: remember that you can tie any aesthetic in your plot (e.g. color) to a variable be writing aesthetic = ~variable_name. For example color = ~variable1 will color the points based on variable1.
  • Write two observations from your plot.

Exercise 3

Fit a linear model including both variables you chose above.

Exercise 4

Interpret the following in the context of the data:

  • Intercept

  • Coefficient of your quantitative variable

  • Coefficient(s) of your categorical variables

Exercise 5

Choose one of the coefficients from your model and write out all the steps in the hypothesis test that corresponds to it’s p-value as in this slide.

Exercise 6

Interpret the other p-values in your model in the context of the data. You do not need to write out the full hypothesis testing framework as you did in Exercise 5.

Exercise 7

Generate 95% confidence intervals for all of the slopes in your model and interpret them in the context of the data.

Exercise 8

Now fit a model using the same explanatory variables, but include an interaction term. Interpret the coefficient of your interaction term in the context of the data.

Exercise 9

Interpret the p-value associated with the interaction term.

Exercise 10

Based on the results from exercise 9, would you choose the model with or without the interaction term?

Grading

Total points available: 15 points.

Component Points
Ex 1 1
Ex 2 2
Ex 3 1
Ex 4 1
Ex 5 1
Ex 6 1
Ex 7 2
Ex 8 2
Ex 9 1
Ex 10 1
Grammar & Writing 11
Workflow & formatting 12

Footnotes

  1. The “Grammar & Writing” grade is decided based on your grammar and writing. This is typically decided by choosing one of the questions and assessing the writing.↩︎

  2. The “Workflow & formatting” grade is to assess the reproducible workflow and document format. This includes having a neatly organized document with readable code and your name and the date in the YAML.↩︎