Model Evaluation

Prof. Eric Friedlander

Application exercise

📋 AE 10 - Model Evaluation: Open in Deepnote

Complete Exercise 0.

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)   # for plotting using formulas
library(broom)       # for formatting model output
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# Spotify Dataset
spotify <- read_csv("../data/spotify-popular.csv")

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Quick Data Cleaning

spotify <- spotify |> 
  mutate(duration_min = duration_ms / 60000)
  • What is this code doing?
  • Why might I be doing it?

The regression model, revisited

spotify_fit <- lm(danceability ~ duration_min, data = spotify)

tidy(spotify_fit, conf.int = TRUE, conf.level = 0.95) |>
   kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.781 0.028 28.351 0.000 0.727 0.835
duration_min -0.024 0.008 -3.151 0.002 -0.039 -0.009
  • There is strong statistical evidence that there is a linear relationship between the duration of a song and it’s danceability.

  • We are 95% confidence that as the length of a song increases by one minute the danceability will decrease by between 0.009 and 0.039 units.

Model evaluation

Partitioning Variability

Let’s think about variation:

  • DATA = MODEL + ERROR
  • \(\substack{\text{Variation} \\ \text{in Y}} = \substack{\text{Variation explained} \\ \text{by model}} + \substack{\text{Variation not explained} \\ \text{by model}}\)

Partitioning Variability (ANOVA)

  • \(y_i - \bar{y} = (\hat{y}_i - \bar{y}) + (y_i-\hat{y}_i)\)
  • Square and sum: \(\sum(y_i-\bar{y})^2 = \sum(\hat{y} - \bar{y})^2 + \sum(y-\hat{y})^2\)
  • \(\substack{\text{Sum of squares} \\ \text{Total}} = \substack{\text{Sum of squares} \\ \text{model}} + \substack{\text{Sum of squares} \\ \text{error}}\)
  • \(SSTotal = SSModel + SSE\)
  • \(SST = SSM + SSE\)
  • ANOVA: Analysis of Variance

ANOVA in R

spotify_fit |>
  anova() |>
  tidy() |>
  kable() # Ignore this line when working in deepnote
term df sumsq meansq statistic p.value
duration_min 1 0.1685294 0.1685294 9.928516 0.0017237
Residuals 506 8.5889829 0.0169743 NA NA
  • More on this later in the semester
  • Complete Exercise 1.

Recall: Correlation Coefficient

  • The correlation coefficient, \(r\), is a number between -1 and +1 that measures how strong the linear relationship between two variables \(x\) and \(y\) is.

\[ r = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}} = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{s_xs_y} \]

Two statistics: \(R^2\)

  • R-squared, \(R^2\), Coefficient of Determination : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor) \[ R^2 = Cor(y, \hat{y})^2 \]
    • Also called PRE (Percent Reduction in Error) because: \[ R^2 = \frac{SSModel}{SSTotal} \]

Two statistics: RMSE

  • Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome) \[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]
    • Sometimes people just care about numerator (SSE) or version without the square-root (MSE)
    • Sometimes the denominator may have \(n-1\) instead

What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?

\(R^2\)

  • Ranges between 0 (terrible predictor) and 1 (perfect predictor)
  • Has no units
  • Calculate with rsq() from yardstick package using the augmented data:
library(yardstick)
spotify_aug <- augment(spotify_fit)

rsq(spotify_aug, truth = danceability, estimate = .fitted) |> kable()
.metric .estimator .estimate
rsq standard 0.019244

Interpreting \(R^2\)

🗳️ Discussion

The \(R^2\) of the model for danceability from Average_Income_K is 1.9%. Which of the following is the correct interpretation of this value?

  1. duration_min correctly predicts 1.9% of danceability.
  2. 1.9% of the variability in danceability can be explained by duration_min.
  3. 1.9% of the variability in duration_min can be explained by danceability.
  4. 1.9% of the time danceability can be predicted by duration_min.

Complete Exercise 2.

Activity

In groups, at the board, design a simulation-based procedure for producing a p-value for the following hypothesis test.

  • \(H_0: R^2 = 0\)
  • \(H_A: R^2 \neq 0\)

Complete Exercise 3.

RMSE

  • Ranges between 0 (perfect predictor) and infinity (terrible predictor)

  • Same units as the response variable

  • Interpretation (kind of): how much does my model miss by, on average.

  • Calculate with rmse() from yardstick package using the augmented data:

rmse(spotify_aug, truth = danceability, estimate = .fitted)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.130
  • Complete Exercise 4.

Using the word “Good”

  • There is no such thing as a “Good” \(R^2\) or, especially, RMSE without context
  • Whether your model is a “Good” model depends on many things:
    • What are you using your model for?
    • How good are other models?

Recap

  • Can decompose total variation (SST) into variation explained by the model (SSM) and leftover variation (SSE)
  • Two metrics for evaluating and comparing models:
    • \(R^2\): What proportion of the variation in the response variable is explained by the model?
    • \(RMSE\): How far is does my model miss by on average?