library(tidyverse)
library(broom)
library(yardstick)
library(ggformula)
library(patchwork)
library(knitr)
library(kableExtra)
<- read_csv("../data/tip-data.csv") |>
tips select(-`Tip Percentage`, -`W/Tip`)
AE 11: Model Comparison
Open RStudio and create a subfolder in your AE folder called “AE-11”.
Go to the Canvas and locate your
AE-11
assignment to get started.Upload the
ae-11.qmd
andtip-data.csv
files into the folder you just created. The.qmd
and PDF responses are due in Canvas. You can check the due date on the Canvas assignment.
Packages + data
What factors are associated with the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.1
The variables we’ll focus on for this analysis are
Tip
: amount of the tipParty
: Number of people in the partyMeal
: Time of day (Lunch
,Dinner
,Late Night
)Age
: Age category of person paying the bill (Yadult
,Middle
,SenCit
)Day
: Day of the weekPayment
: the type of payment usedGiftCard
: whether a gift card was used
Comps
: Whether any food was comped
Alcohol
: Whether any alcohol was orderedBday
: whether it was someones birthday
Bill
: the size of the bill
Analysis goal
The goals of this activity are to:
- Use ANOVA to determine whether our model is useful as a whole
- Begin thinking about \(R^2\) in a multivariate setting
Exercise 0
Complete the following to clean and then observe your data:
- Use
drop_na
to remove any rows whereParty
is missing.
<- tips |>
tips # drop missing values from party _______
- Generate a bar chart of the variable
Meal
.
# insert code here
- Run the following code and generate the same bar chart as above. You can even copy and paste your code. What’s the difference between the two plots? What do you think
fct_relevel
does?
<- tips |>
tips mutate(
Meal = fct_relevel(Meal, "Lunch", "Dinner", "Late Night"),
Age = fct_relevel(Age, "Yadult", "Middle", "SenCit")
)
# Generate plot here
Exercise 1
Fir a model called tip_fit_1
to predict Tip
from Party
, Age
, and Meal
Exercise 2
Pipe the model you generated in the previous Exercise into the function anova
.
Exercise 3
Based on the output above, compute \(SSTotal\), \(SSError\), and \(SSModel\). You should only need to use addition and/or subtraction.
Exercise 4
Pipe your model into the glance
function. Identify, the F-statistic and p-value for an F-test of this model. Interpret the outcome of your test in the context of the problem. Be prepared to discuss the difference between this p-value and the p-values from the individual model coefficients.
Exercise 5
Fit a model tip_fit_2
to predict Tip
from Party
, Age
, Meal
, and Day
.
Exercise 6
Of the two models you just fit, which is nested inside the other?
Exercise 7
Fit a third model called tip_fit_3
so that tip_fit_1
is nested inside tip_fit_3
. Try to choose a model that you think will do a better job than tip_fit_1
of modeling the data. Feel free to include interaction or polynomial terms. Have your reporter write the model on the board.
Exercise 8
Change eval: FALSE
to eval: TRUE
. Alter the code below so that you’re comparing tip_fit_3
to tip_fit_1
.
anova(tip_fit_1, tip_fit_2) |> # Enter reduced model first
tidy() |>
kable()
Write down the hypotheses, test statistic, and p-value of your test. Interpret the output in the context of the problem. Have your reporter write the p-value on the board.
Exercise 9
Based on your test, choose tip_fit_1
or tip_fit_3
. What is the \(R^2\) value for this model?
Exercise 10
Fit the full model. What is it’s \(R^2\)? Does this model have a higher or lower \(R^2\) than the previous model? Does this mean it is a better model? Be prepared to discuss.
Exercise 11
The following code converts the numerical variable Bill
into a new categorical variable Bill_factor
. Essentially, each different number in Bill
is treated as it’s own category. Fit a model predicting Tip
from Bill_factor
. What is your R^2
? Think about the implications of this and what it means for the usefulness of \(R^2\).
<- tips |>
tips mutate(Bill_factor = factor(Bill))
Exercise 12
Apply glance
to both models above to find the \(R^2\) and \(R^2_{adj}\) values?
Exercise 13
- Which model would we choose based on \(R^2\)?
- Which model would we choose based on Adjusted \(R^2\)?
- Which statistic should we use to choose the final model - \(R^2\) or Adjusted \(R^2\)? Why?
Exercise 14
Reference the output from Exercise 9.
Which model would we choose based on AIC?
Which model would we choose based on BIC?
Exercise 15
Navigate to the website for olsrr
here. Choose a selection technique other than best subset and apply it. Fit your resulting model.
To submit the AE
- Render the document to produce the PDF with all of your work from today’s class.
- Upload your QMD and PDF files to the Canvas assignment.
Footnotes
Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.” Project for Statistics 212-Statistics for the Sciences, St. Olaf College.↩︎