Model Evaluation

Prof. Eric Friedlander

Application exercise

📋 AE 10 - Model Evaluation: Open in Deepnote

Complete Exercise 0.

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)   # for plotting using formulas
library(broom)       # for formatting model output
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# Spotify Dataset
spotify <- read_csv("../data/spotify-popular.csv")

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Quick Data Cleaning

spotify <- spotify |> 
  mutate(duration_min = duration_ms / 60000)

What is this code doing?
Why might I be doing it?

The regression model, revisited

spotify_fit <- lm(danceability ~ duration_min, data = spotify)

tidy(spotify_fit, conf.int = TRUE, conf.level = 0.95) |>
   kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.781	0.028	28.351	0.000	0.727	0.835
duration_min	-0.024	0.008	-3.151	0.002	-0.039	-0.009

There is strong statistical evidence that there is a linear relationship between the duration of a song and it’s danceability.
We are 95% confidence that as the length of a song increases by one minute the danceability will decrease by between 0.009 and 0.039 units.

Model evaluation

Partitioning Variability

Let’s think about variation:

DATA = MODEL + ERROR
\(\substack{\text{Variation} \\ \text{in Y}} = \substack{\text{Variation explained} \\ \text{by model}} + \substack{\text{Variation not explained} \\ \text{by model}}\)

Partitioning Variability (ANOVA)

\(y_i - \bar{y} = (\hat{y}_i - \bar{y}) + (y_i-\hat{y}_i)\)
Square and sum: \(\sum(y_i-\bar{y})^2 = \sum(\hat{y} - \bar{y})^2 + \sum(y-\hat{y})^2\)
\(\substack{\text{Sum of squares} \\ \text{Total}} = \substack{\text{Sum of squares} \\ \text{model}} + \substack{\text{Sum of squares} \\ \text{error}}\)
\(SSTotal = SSModel + SSE\)
\(SST = SSM + SSE\)
ANOVA: Analysis of Variance

ANOVA in R

spotify_fit |>
  anova() |>
  tidy() |>
  kable() # Ignore this line when working in deepnote

term	df	sumsq	meansq	statistic	p.value
duration_min	1	0.1685294	0.1685294	9.928516	0.0017237
Residuals	506	8.5889829	0.0169743	NA	NA

More on this later in the semester
Complete Exercise 1.

Recall: Correlation Coefficient

The correlation coefficient, \(r\), is a number between -1 and +1 that measures how strong the linear relationship between two variables \(x\) and \(y\) is.

\[ r = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}} = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{s_xs_y} \]

Two statistics: \(R^2\)

R-squared, \(R^2\), Coefficient of Determination : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor) \[ R^2 = Cor(y, \hat{y})^2 \]
- Also called PRE (Percent Reduction in Error) because: \[ R^2 = \frac{SSModel}{SSTotal} \]

Two statistics: RMSE

Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome) \[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]
- Sometimes people just care about numerator (SSE) or version without the square-root (MSE)
- Sometimes the denominator may have \(n-1\) instead

What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?

\(R^2\)

Ranges between 0 (terrible predictor) and 1 (perfect predictor)
Has no units
Calculate with rsq() from yardstick package using the augmented data:

library(yardstick)
spotify_aug <- augment(spotify_fit)

rsq(spotify_aug, truth = danceability, estimate = .fitted) |> kable()

.metric	.estimator	.estimate
rsq	standard	0.019244

Interpreting \(R^2\)

🗳️ Discussion

The \(R^2\) of the model for danceability from Average_Income_K is 1.9%. Which of the following is the correct interpretation of this value?

duration_min correctly predicts 1.9% of danceability.
1.9% of the variability in danceability can be explained by duration_min.
1.9% of the variability in duration_min can be explained by danceability.
1.9% of the time danceability can be predicted by duration_min.

Complete Exercise 2.

Activity

In groups, at the board, design a simulation-based procedure for producing a p-value for the following hypothesis test.

\(H_0: R^2 = 0\)
\(H_A: R^2 \neq 0\)

Complete Exercise 3.

RMSE

Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
Interpretation (kind of): how much does my model miss by, on average.
Calculate with rmse() from yardstick package using the augmented data:

rmse(spotify_aug, truth = danceability, estimate = .fitted)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.130

Complete Exercise 4.

Using the word “Good”

There is no such thing as a “Good” \(R^2\) or, especially, RMSE without context
Whether your model is a “Good” model depends on many things:
- What are you using your model for?
- How good are other models?

Recap

Can decompose total variation (SST) into variation explained by the model (SSM) and leftover variation (SSE)
Two metrics for evaluating and comparing models:
- \(R^2\): What proportion of the variation in the response variable is explained by the model?
- \(RMSE\): How far is does my model miss by on average?