# load packageslibrary(tidyverse) # for data wrangling and visualizationlibrary(ggformula) # for plotting using formulaslibrary(broom) # for formatting model outputlibrary(scales) # for pretty axis labelslibrary(knitr) # for pretty tableslibrary(kableExtra) # also for pretty tableslibrary(patchwork) # arrange plots# Spotify Datasetspotify <-read_csv("../data/spotify-popular.csv")# set default theme and larger font size for ggplot2ggplot2::theme_set(ggplot2::theme_bw(base_size =20))
Image source: Introduction to the Practice of Statistics (5th ed)
Model conditions
Model conditions
Linearity: There is a linear relationship between the outcome and predictor variables
Constant variance: The variability of the errors is equal for all values of the predictor variable
Normality: The errors follow a normal distribution
Independence: The errors are independent from each other
WARNING
Many of these assumptions are for the population
We want to determine whether they are met from your data
In real life, these conditions are almost always violated in one way or another
Questions you should ask yourself:
Are my conditions close enough to being satisfied that I can trust the results of my inference
Do I have reason to believe that my conditions are GROSSLY violated?
Based on what I see, how trustworthy do I think the results of my inference are.
ENGAGE: SOAP BOX
Statistics and numbers are often used to make arguments seem more “rigorous” or infallible. I’m sure you’ve heard the phrase “the numbers are the numbers” or “you can’t argue with the numbers”. More often than not, this is BULLSHIT. Most data analyses involve making decision which are subjective. The interpretability of any form of statistical inference is heavily influenced by whether the assumptions and conditions of that inference is met, which they almost never are. It is up to the practitioner to determine whether those conditions are met and what impact those conditions have on the results of those analyses. In my work, I rarely encounter practitioners who even know what the conditions are, let alone understand why they are important. FURTHERMORE!!! the quality of your analysis is only as good as the quality of your data. Remember a crap study design will yield crap data which will yield crappy analysis. Statistical analyses yield one important form of evidence which should be combined with other forms of evidence when making an argument.
Linearity
If the linear model, \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1X\), adequately describes the relationship between \(X\) and \(Y\), then the residuals should reflect random (chance) error
To assess this, we can look at a plot of the residuals (\(e_i\)’s) vs. the fitted values (\(\hat{y}_i\)’s) or predictors(or \(x_i\)’s)
There should be no distinguishable pattern in the residuals plot, i.e. the residuals should be randomly scattered
A non-random pattern (e.g. a parabola) suggests a linear model that does not adequately describe the relationship between \(X\) and \(Y\)
Linearity
✅ The residuals vs. fitted values plot should show a random scatter of residuals (no distinguishable pattern or structure)
spotify_aug <-augment(spotify_fit)gf_point(.resid ~ .fitted, data = spotify_aug) |>gf_hline(yintercept =0, linetype ="dashed") |>gf_labs(x ="Fitted value", y ="Residual",title ="Residuals vs. fitted values" )
Non-linear relationships
Violations of Linearity
Impact: inference relies on estimates of \(\sigma_\epsilon\) computed from residuals:
Residuals will be larger in certain places so estimates will be inaccurate
Therefore, inference (i.e. CIs and p-values) will be inaccurate
Most importantly… your predictions will be wrong most of the time
Remedy: transform your data (to come)
Complete Exercises 1-3
Constant variance
If the spread of the distribution of \(Y\) is equal for all values of \(X\) then the spread of the residuals should be approximately equal for each value of \(X\)
To assess this, we can look at a plot of the residuals vs. the fitted values
The vertical spread of the residuals should be approximately equal as you move from left to right
CAREFUL: Inconsistent distribution of \(X\)s can make it seem as if there is non-constant variance
Constant variance
✅ The vertical spread of the residuals is relatively constant across the plot
Non-constant variance: Fan-Pattern
Constant variance is frequently violated when the error/variance is proportional to the response variable
Whose wealth fluctuates more per day… Dr. Friedlander’s or Elon Musk’s?
Violations of Constant Variance
Impact: inference relies on estimates of \(\sigma_\epsilon\) computed from residuals:
Residuals will be larger in certain places so estimates will be inaccurate
Therefore, inference (i.e. CIs and p-values) will be inaccurate
Remedy: transform your data (to come)
Complete Exercises 4
Normality
The linear model assumes that the distribution of \(Y\) is Normal for every value of \(X\)
This is impossible to check in practice, so we will look at the overall distribution of the residuals to assess if the normality assumption is satisfied
A histogram of the residuals should look approximately normal, symmetric, without any huge outliers
A normal QQ-plot falls along a diagonal line
Most inferential methods for regression are robust to some departures from normality, so we can proceed with inference if the sample size is sufficiently large, roughly \(n > 30\) depending on how non-normal your residuals look
Notable exception: predictions intervals!
Check normality using a histogram
Check normality using a QQ-plot
Code
gf_qq(~.resid, data = spotify_aug) |>gf_qqline() |>gf_labs(x ="Theoretical quantile",y ="Observed quantile",title ="Normal QQ-plot of residuals")
\(x\)-axis: quantile we would expect from a true normal distribution
\(y\)-axis: quantile we observe in the data
Bell-shaped does not necessarily equal normal… QQ-plot can detect distributions with heavier (i.e. more spread out) tails than a normal distribution
Check normality using a QQ-plot
Code
gf_histogram(~.resid, data = spotify_aug,bins=7, color ="white") |>gf_labs(x ="Residual",y ="Count",title ="Histogram of residuals" )
Code
gf_qq(~.resid, data = spotify_aug) |>gf_qqline() |>gf_labs(x ="Theoretical quantile",y ="Observed quantile",title ="Normal QQ-plot of residuals")
Assess whether residuals lie along the diagonal line of the Quantile-quantile plot (QQ-plot).
If so, the residuals are normally distributed.
Note: QQ-Plots are pretty sensitive so it doesn’t take too much departure to conclude non-normality
Normality
❌ The residuals do not appear to follow a normal distribution, because the points do not lie on the diagonal line, so normality is not satisfied.
✅ The sample size \(n = 508> 30\), so the sample size is large enough to relax this condition and proceed with inference (mostly).
Violations of Normality
Impact: depends what you want to do and how large your sample size is
Your predictions intervals will be wrong… they will be symmetric when they shouldn’t be…
If you have a large sample size not a big deal for anything else
Remedy… depends on what you want to do
If sample size is large enough and don’t care about prediction intervals… do nothing
Otherwise, transform data… (hard for small sample sizes)
Complete Exercises 5 and 6.
Independence
We can often check the independence assumption based on the context of the data and how the observations were collected
Two common violations of the independence assumption:
Temporal Correlation: If the data were collected over time, plot the residuals in time order to see if there is a pattern (serial correlation)
Cluster Effect: If there are subgroups represented in the data that are not accounted for in the model, you can color the points in the residual plots by group to see if the model systematically over or under predicts for a particular subgroup
Spatial Correlation: If observations that were close to one another are more correlated than ones which are far apart then independent is violated
Independence
Complete Exercise 7
❌ Based on the information we have, it’s unlikely the data are independent.
Violations of Independence
Impact: depends on how it’s violated
In some cases you’ll underestimate p-values (too many false positives) and make CIs which are too narrow
In other cases you’ll do the opposite
Remedy… depends on the source and type of dependence
Square and sum: \(\sum(y_i-\bar{y})^2 = \sum(\hat{y} - \bar{y})^2 + \sum(y-\hat{y})^2\)
\(\substack{\text{Sum of squares} \\ \text{Total}} = \substack{\text{Sum of squares} \\ \text{model}} + \substack{\text{Sum of squares} \\ \text{error}}\)
\(SSTotal = SSModel + SSE\)
\(SST = SSM + SSE\)
ANOVA in R
spotify_fit |>anova() |>tidy() |>kable()
term
df
sumsq
meansq
statistic
p.value
duration_min
1
0.1685294
0.1685294
9.928516
0.0017237
Residuals
506
8.5889829
0.0169743
NA
NA
More on this later in the semester
Complete Exercise 8.
Recall: Correlation Coefficient
The correlation coefficient, \(r\), is a number between -1 and +1 that measures how strong the linear relationship between two variables \(x\) and \(y\) is.
R-squared, \(R^2\), Coefficient of Determination : Percentage of variability in the outcome explained by the regression model (in the context of SLR, the predictor) \[
R^2 = Cor(y, \hat{y})^2
\]
Also called PRE (Percent Reduction in Error) because: \[
R^2 = \frac{SSModel}{SSTotal}
\]
Two statistics: RMSE
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome) \[
RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}}
\]
Sometimes people just case about numerator (SSE) or version without the square-root (MSE)
Sometimes the denominator may have \(n-1\) instead
What indicates a good model fit? Higher or lower \(R^2\)? Higher or lower RMSE?
\(R^2\)
Ranges between 0 (terrible predictor) and 1 (perfect predictor)
Has no units
Calculate with rsq() from yardstick package using the augmented data:
library(yardstick)spotify_aug <-augment(spotify_fit)rsq(spotify_aug, truth = danceability, estimate = .fitted) |>kable()
.metric
.estimator
.estimate
rsq
standard
0.019244
Interpreting \(R^2\)
🗳️ Discussion
The \(R^2\) of the model for danceability from Average_Income_K is 1.9%. Which of the following is the correct interpretation of this value?
duration_min correctly predicts 1.9% of danceability.
1.9% of the variability in danceability can be explained by duration_min.
1.9% of the variability in duration_min can be explained by danceability.
1.9% of the time danceability can be predicted by duration_min.
Complete Exercise 9.
Activity
In groups, at the board, design a simulation-based procedure for producing a p-value for the following hypothesis test.
\(H_0: R^2 = 0\)
\(H_A: R^2 \neq 0\)
RMSE
Ranges between 0 (perfect predictor) and infinity (terrible predictor)
Same units as the response variable
Interpretation (kind of): how much does my model miss by, on average.
Calculate with rmse() from yardstick package using the augmented data:
rmse(spotify_aug, truth = danceability, estimate = .fitted)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.130
Complete Exercise 10.
Using the word “Good”
There is no such thing as a “Good” \(R^2\) or, especially, RMSE without context
Whether your model is a “Good” model depends on many things:
What are you using your model for?
How good are other models?
Recap
Used residual plots to check conditions for SLR:
Linearity (residuals vs fitted vals)
Constant variance (residuals vs fitted vals)
Normality (histogram/QQ-plot of residuals)
Independence (knowledge of data collection)
Note: Predictions are still valid as long as linearity is met but p-values and CIs are not without other three
Can decompose total variation (SST) into variation explained by the model (SSM) and leftover variation (SSE)
Two metrics for evaluating and comparing models:
\(R^2\): What proportion of the variation in the response variable is explained by the model?
\(RMSE\): How far is does my model miss by on average?