SLR: Conditions + Model Evaluation

Prof. Eric Friedlander

Application exercise

Complete Exercise 0.

Computational set up

# load packages
library(tidyverse)   # for data wrangling and visualization
library(ggformula)   # for plotting using formulas
library(broom)       # for formatting model output
library(scales)      # for pretty axis labels
library(knitr)       # for pretty tables
library(kableExtra)  # also for pretty tables
library(patchwork)   # arrange plots

# Spotify Dataset
spotify <- read_csv("../data/spotify-popular.csv")

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Quick Data Cleaning

spotify <- spotify |> 
  mutate(duration_min = duration_ms / 60000)

What is this code doing?
Why might I be doing it?

The regression model, revisited

spotify_fit <- lm(danceability ~ duration_min, data = spotify)

tidy(spotify_fit, conf.int = TRUE, conf.level = 0.95) |>
   kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.781	0.028	28.351	0.000	0.727	0.835
duration_min	-0.024	0.008	-3.151	0.002	-0.039	-0.009

There is strong statistical evidence that there is a linear relationship between the duration of a song and it’s danceability.
We are 95% confidence that as the length of a song increases by one minute the danceability will decrease by between 0.009 and 0.039 units.

Mathematical representation, visualized

\[ Y|X \sim N(\beta_0 + \beta_1 X, \sigma_\epsilon^2) \]

Image source: Introduction to the Practice of Statistics (5th ed)

Model conditions

Linearity: There is a linear relationship between the outcome and predictor variables
Constant variance: The variability of the errors is equal for all values of the predictor variable
Normality: The errors follow a normal distribution
Independence: The errors are independent from each other

WARNING

Many of these assumptions are for the population
We want to determine whether they are met from your data
In real life, these conditions are almost always violated in one way or another
Questions you should ask yourself:
- Are my conditions close enough to being satisfied that I can trust the results of my inference
- Do I have reason to believe that my conditions are GROSSLY violated?
- Based on what I see, how trustworthy do I think the results of my inference are.

ENGAGE: SOAP BOX

Statistics and numbers are often used to make arguments seem more “rigorous” or infallible. I’m sure you’ve heard the phrase “the numbers are the numbers” or “you can’t argue with the numbers”. More often than not, this is BULLSHIT. Most data analyses involve making decision which are subjective. The interpretability of any form of statistical inference is heavily influenced by whether the assumptions and conditions of that inference is met, which they almost never are. It is up to the practitioner to determine whether those conditions are met and what impact those conditions have on the results of those analyses. In my work, I rarely encounter practitioners who even know what the conditions are, let alone understand why they are important. FURTHERMORE!!! the quality of your analysis is only as good as the quality of your data. Remember a crap study design will yield crap data which will yield crappy analysis. Statistical analyses yield one important form of evidence which should be combined with other forms of evidence when making an argument.

Linearity

If the linear model, \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1X\), adequately describes the relationship between \(X\) and \(Y\), then the residuals should reflect random (chance) error
To assess this, we can look at a plot of the residuals (\(e_i\)’s) vs. the fitted values (\(\hat{y}_i\)’s) or predictors(or \(x_i\)’s)
There should be no distinguishable pattern in the residuals plot, i.e. the residuals should be randomly scattered
A non-random pattern (e.g. a parabola) suggests a linear model that does not adequately describe the relationship between \(X\) and \(Y\)

Linearity

✅ The residuals vs. fitted values plot should show a random scatter of residuals (no distinguishable pattern or structure)

The augment function

augment is from the broom package:

spotify_aug <- augment(spotify_fit)

head(spotify_aug) |> kable()

danceability	duration_min	.fitted	.resid	.hat	.sigma	.cooksd	.std.resid
0.733	2.464000	0.7212126	0.0117874	0.0057228	0.1304132	0.0000237	0.0907334
0.630	3.258650	0.7019569	-0.0719569	0.0021749	0.1303749	0.0003332	-0.5529036
0.877	3.864133	0.6872849	0.1897151	0.0024253	0.1301401	0.0025838	1.4579193
0.831	3.674783	0.6918732	0.1391268	0.0020725	0.1302669	0.0011866	1.0689702
0.668	2.650183	0.7167011	-0.0487011	0.0044968	0.1303962	0.0003170	-0.3746465
0.626	3.377017	0.6990886	-0.0730886	0.0020230	0.1303736	0.0003196	-0.5615572

Residuals vs. fitted values (code)

spotify_aug <- augment(spotify_fit)

gf_point(.resid ~ .fitted, data = spotify_aug) |>
  gf_hline(yintercept = 0, linetype = "dashed") |>
  gf_labs(
    x = "Fitted value", y = "Residual",
    title = "Residuals vs. fitted values"
  )

Non-linear relationships

Violations of Linearity

Impact: inference relies on estimates of \(\sigma_\epsilon\) computed from residuals:
- Residuals will be larger in certain places so estimates will be inaccurate
- Therefore, inference (i.e. CIs and p-values) will be inaccurate
- Most importantly… your predictions will be wrong most of the time
Remedy: transform your data (to come)

Complete Exercises 1-3

Constant variance

If the spread of the distribution of \(Y\) is equal for all values of \(X\) then the spread of the residuals should be approximately equal for each value of \(X\)
To assess this, we can look at a plot of the residuals vs. the fitted values
The vertical spread of the residuals should be approximately equal as you move from left to right
CAREFUL: Inconsistent distribution of \(X\)s can make it seem as if there is non-constant variance

Constant variance

✅ The vertical spread of the residuals is relatively constant across the plot

Non-constant variance: Fan-Pattern

Constant variance is frequently violated when the error/variance is proportional to the response variable
Whose wealth fluctuates more per day… Dr. Friedlander’s or Elon Musk’s?

Violations of Constant Variance

Impact: inference relies on estimates of \(\sigma_\epsilon\) computed from residuals:
- Residuals will be larger in certain places so estimates will be inaccurate
- Therefore, inference (i.e. CIs and p-values) will be inaccurate
Remedy: transform your data (to come)

Complete Exercises 4

Normality

The linear model assumes that the distribution of \(Y\) is Normal for every value of \(X\)
This is impossible to check in practice, so we will look at the overall distribution of the residuals to assess if the normality assumption is satisfied
A histogram of the residuals should look approximately normal, symmetric, without any huge outliers
A normal QQ-plot falls along a diagonal line
Most inferential methods for regression are robust to some departures from normality, so we can proceed with inference if the sample size is sufficiently large, roughly \(n > 30\) depending on how non-normal your residuals look
- Notable exception: predictions intervals!

Check normality using a histogram

Check normality using a QQ-plot

Code

gf_qq(~.resid, data = spotify_aug) |>
  gf_qqline() |>
  gf_labs(x = "Theoretical quantile",
       y = "Observed quantile",
       title = "Normal QQ-plot of residuals")

\(x\)-axis: quantile we would expect from a true normal distribution
\(y\)-axis: quantile we observe in the data
Bell-shaped does not necessarily equal normal… QQ-plot can detect distributions with heavier (i.e. more spread out) tails than a normal distribution

Check normality using a QQ-plot

Code

gf_histogram(~.resid, data = spotify_aug,
             bins=7, color = "white") |>
  gf_labs(
    x = "Residual",
    y = "Count",
    title = "Histogram of residuals"
  )

Code

gf_qq(~.resid, data = spotify_aug) |>
  gf_qqline() |>
  gf_labs(x = "Theoretical quantile",
       y = "Observed quantile",
       title = "Normal QQ-plot of residuals")

Assess whether residuals lie along the diagonal line of the Quantile-quantile plot (QQ-plot).
If so, the residuals are normally distributed.
Note: QQ-Plots are pretty sensitive so it doesn’t take too much departure to conclude non-normality

Normality

❌ The residuals do not appear to follow a normal distribution, because the points do not lie on the diagonal line, so normality is not satisfied.

✅ The sample size \(n = 508> 30\), so the sample size is large enough to relax this condition and proceed with inference (mostly).

Violations of Normality

Impact: depends what you want to do and how large your sample size is
- Your predictions intervals will be wrong… they will be symmetric when they shouldn’t be…
- If you have a large sample size not a big deal for anything else
Remedy… depends on what you want to do
- If sample size is large enough and don’t care about prediction intervals… do nothing
- Otherwise, transform data… (hard for small sample sizes)

Complete Exercises 5 and 6.

Independence

We can often check the independence assumption based on the context of the data and how the observations were collected
Two common violations of the independence assumption:
- Temporal Correlation: If the data were collected over time, plot the residuals in time order to see if there is a pattern (serial correlation)
- Cluster Effect: If there are subgroups represented in the data that are not accounted for in the model, you can color the points in the residual plots by group to see if the model systematically over or under predicts for a particular subgroup
- Spatial Correlation: If observations that were close to one another are more correlated than ones which are far apart then independent is violated

Independence

Complete Exercise 7

❌ Based on the information we have, it’s unlikely the data are independent.

Violations of Independence

Impact: depends on how it’s violated
- In some cases you’ll underestimate p-values (too many false positives) and make CIs which are too narrow
- In other cases you’ll do the opposite
Remedy… depends on the source and type of dependence
- Add variable accounting for dependence to model
- Otherwise, beyond the scope of this class
  - Time-Series Analysis
  - Mixed-Effects Models
  - Spacial Statistics

Model evaluation

Partitioning Variability

Let’s think about variation:

DATA = MODEL + ERROR
\(\substack{\text{Variation} \\ \text{in Y}} = \substack{\text{Variation explained} \\ \text{by model}} + \substack{\text{Variation not explained} \\ \text{by model}}\)

Partitioning Variability (ANOVA)

\(y_i - \bar{y} = (\hat{y}_i - \bar{y}) + (y_i-\hat{y}_i)\)
Square and sum: \(\sum(y_i-\bar{y})^2 = \sum(\hat{y} - \bar{y})^2 + \sum(y-\hat{y})^2\)
\(\substack{\text{Sum of squares} \\ \text{Total}} = \substack{\text{Sum of squares} \\ \text{model}} + \substack{\text{Sum of squares} \\ \text{error}}\)
\(SSTotal = SSModel + SSE\)
\(SST = SSM + SSE\)

ANOVA in R

spotify_fit |>
  anova() |>
  tidy() |>
  kable()

term	df	sumsq	meansq	statistic	p.value
duration_min	1	0.1685294	0.1685294	9.928516	0.0017237
Residuals	506	8.5889829	0.0169743	NA	NA

More on this later in the semester
Complete Exercise 8.

Recall: Correlation Coefficient

The correlation coefficient, \(r\), is a number between -1 and +1 that measures how strong the linear relationship between two variables \(x\) and \(y\) is.

\[ r = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}} = \frac{\sum(x_i - \bar{x})(y_i-\bar{y})}{s_xs_y} \]

Two statistics: \(R^2\)

R-squared, \(R^2\), Coefficient of Determination : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor) \[ R^2 = Cor(y, \hat{y})^2 \]
- Also called PRE (Percent Reduction in Error) because: \[ R^2 = \frac{SSModel}{SSTotal} \]

Two statistics: RMSE

Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome) \[ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]
- Sometimes people just case about numerator (SSE) or version without the square-root (MSE)
- Sometimes the denominator may have \(n-1\) instead

What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?

\(R^2\)

Ranges between 0 (terrible predictor) and 1 (perfect predictor)
Has no units
Calculate with rsq() from yardstick package using the augmented data:

library(yardstick)
spotify_aug <- augment(spotify_fit)

rsq(spotify_aug, truth = danceability, estimate = .fitted) |> kable()

.metric	.estimator	.estimate
rsq	standard	0.019244

Interpreting \(R^2\)

🗳️ Discussion

The \(R^2\) of the model for danceability from Average_Income_K is 1.9%. Which of the following is the correct interpretation of this value?

duration_min correctly predicts 1.9% of danceability.
1.9% of the variability in danceability can be explained by duration_min.
1.9% of the variability in duration_min can be explained by danceability.
1.9% of the time danceability can be predicted by duration_min.

Complete Exercise 9.

Activity

In groups, at the board, design a simulation-based procedure for producing a p-value for the following hypothesis test.

\(H_0: R^2 = 0\)
\(H_A: R^2 \neq 0\)

RMSE

Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
Interpretation (kind of): how much does my model miss by, on average.
Calculate with rmse() from yardstick package using the augmented data:

rmse(spotify_aug, truth = danceability, estimate = .fitted)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.130

Complete Exercise 10.

Using the word “Good”

There is no such thing as a “Good” \(R^2\) or, especially, RMSE without context
Whether your model is a “Good” model depends on many things:
- What are you using your model for?
- How good are other models?

Recap

Used residual plots to check conditions for SLR:

Linearity (residuals vs fitted vals)
Constant variance (residuals vs fitted vals)

Normality (histogram/QQ-plot of residuals)
Independence (knowledge of data collection)

Note: Predictions are still valid as long as linearity is met but p-values and CIs are not without other three

Can decompose total variation (SST) into variation explained by the model (SSM) and leftover variation (SSE)
Two metrics for evaluating and comparing models:
- \(R^2\): What proportion of the variation in the response variable is explained by the model?
- \(RMSE\): How far is does my model miss by on average?