AE 05: Model Conditions and Evaluation

Songs on Spotify

Important

Open RStudio and create a subfolder in your AE folder called “AE-05”
Go to the Canvas and locate your AE 05 assignment to get started.
Upload the ae-05.qmd and spotify-popular.csv files into the folder you just created.

library(tidyverse)
library(ggformula)
library(broom)
library(knitr)
library(patchwork) #arrange plots in a grid

Introduction

This is a continuation of AE-04. The Data section below is the same as in that exercise. Feel free to skip it if you feel you remember everything about the data set or simply use it as a reference when needed.

Data

The data set for this assignment is a subset from the Spotify Songs Tidy Tuesday data set. The data were originally obtained from Spotify using the spotifyr R package.

It contains numerous characteristics for each song. You can see the full list of variables and definitions here. This analysis will focus specifically on the following variables:

variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

spotify <- read_csv("../data/spotify-popular.csv")

What makes a song danceable? To answer this question, we’ll analyze data on some of the most popular songs on Spotify, i.e. those with track_popularity >= 80. We’ll use linear regression to fit some models to predict a song’s dancebility. Each group will be assigned a different predictor variable.

Predictor Assignment:

Below is your assigned predictor. Look at the table above for a definition.

Group 1: energy
Group 2: loudness
Group 3: valence
Group 4: tempo

Exercise 0

Fit a model using your assigned explanatory variable to predict a songs danceability. Use tidy and kable to neatly display your model and have your reporter write you model on the white board along with a confidence interval and p-value for the slope. Be prepared to discuss the interpretation of these quantities.

## add code

Exercise 1

Use the augment function from the broom package to create an augmented data frame from the model you fit in Exercise 0.

Exercise 2

Make a plot of the residual vs. fitted values.

Exercise 3

To what extent does the linearity condition appear to be met for your data?

Exercise 4

To what extent does the constant variance assumption appear to be met for your data?

Exercise 5

Fill in the code to make a histogram of the residuals and a normal QQ-plot.

resid_hist <- gf_histogram(~____, data = ____)) |> 
  gf_labs(x = "_____", 
       y = "_____", 
       title = "____")

resid_qq <- gf_qq(~____, data = ____)  |>
  gf_qqline() |> 
  gf_labs(x = "_____", 
       y = "_____", 
       title = "____")

resid_hist + resid_qq

Exercise 6

To what extent does the normality condition of your data appear to be met?

Exercise 7

Our data was collected from Spotify… to what extent do you think the independence condition is met? Do you think there is a serial effect? Do you think there is a cluster effect?

Exercise 8

Using the anova function, compute the SSE, SSM, and SST from your model. Note that the function will give you two of those values and you’ll have to use subtraction to compute the other two.

Exercise 9

Load the yardstick package and compute the \(R^2\) of your model. Then verify that you get the same number when you divide your SSM by your SST. How do you interpret this value? Have your reporter write the resulting value on the board.

Exericse 10

Compute the RMSE for your model. Interpret this value.

Exercise 11

Which group has the best model?

Important

To submit the AE:

Render the document to produce the HTML file with all of your work from today’s class.
Upload your QMD and HTML files to the Canvas assignment.