library(tidyverse)
library(ggformula)
library(broom)
library(knitr)
library(patchwork) #arrange plots in a grid
AE 05: Model Conditions and Evaluation
Songs on Spotify
Introduction
This is a continuation of AE-04. The Data section below is the same as in that exercise. Feel free to skip it if you feel you remember everything about the data set or simply use it as a reference when needed.
Data
The data set for this assignment is a subset from the Spotify Songs Tidy Tuesday data set. The data were originally obtained from Spotify using the spotifyr R package.
It contains numerous characteristics for each song. You can see the full list of variables and definitions here. This analysis will focus specifically on the following variables:
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms | double | Duration of song in milliseconds |
<- read_csv("../data/spotify-popular.csv") spotify
What makes a song danceable? To answer this question, we’ll analyze data on some of the most popular songs on Spotify, i.e. those with track_popularity >= 80
. We’ll use linear regression to fit some models to predict a song’s dancebility
. Each group will be assigned a different predictor variable.
Predictor Assignment:
Below is your assigned predictor. Look at the table above for a definition.
- Group 1:
energy
- Group 2:
loudness
- Group 3:
valence
- Group 4:
tempo
Exercise 0
Fit a model using your assigned explanatory variable to predict a songs danceability
. Use tidy
and kable
to neatly display your model and have your reporter write you model on the white board along with a confidence interval and p-value for the slope. Be prepared to discuss the interpretation of these quantities.
## add code
Exercise 1
Use the augment
function from the broom
package to create an augmented data frame from the model you fit in Exercise 0.
Exercise 2
Make a plot of the residual vs. fitted values.
Exercise 3
To what extent does the linearity condition appear to be met for your data?
Exercise 4
To what extent does the constant variance assumption appear to be met for your data?
Exercise 5
Fill in the code to make a histogram of the residuals and a normal QQ-plot.
<- gf_histogram(~____, data = ____)) |>
resid_hist gf_labs(x = "_____",
y = "_____",
title = "____")
<- gf_qq(~____, data = ____) |>
resid_qq gf_qqline() |>
gf_labs(x = "_____",
y = "_____",
title = "____")
+ resid_qq resid_hist
Exercise 6
To what extent does the normality condition of your data appear to be met?
Exercise 7
Our data was collected from Spotify… to what extent do you think the independence condition is met? Do you think there is a serial effect? Do you think there is a cluster effect?
Exercise 8
Using the anova
function, compute the SSE, SSM, and SST from your model. Note that the function will give you two of those values and you’ll have to use subtraction to compute the other two.
Exercise 9
Load the yardstick
package and compute the \(R^2\) of your model. Then verify that you get the same number when you divide your SSM by your SST. How do you interpret this value? Have your reporter write the resulting value on the board.
Exericse 10
Compute the RMSE for your model. Interpret this value.
Exercise 11
Which group has the best model?
To submit the AE:
- Render the document to produce the HTML file with all of your work from today’s class.
- Upload your QMD and HTML files to the Canvas assignment.