Multiple linear regression (MLR)

Prof. Eric Friedlander

Application Exercise

📋 AE 07 - Multiple Linear Regression and Categorical Predictors

Complete Exercise 0-3.

Computational setup

# load packages
library(tidyverse)
library(broom)
library(mosaic)
library(ISLR2)
library(patchwork)
library(knitr)
library(coursekata)
library(kableExtra)
library(scales)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

# Create new variable

Credit <- Credit |> 
  mutate(Has_Balance = factor(ifelse(Balance == 0, "No", "Yes")))

Considering multiple variables

Data: Credit Cards

The data is from the Credit data set in the ISLR2 R package. It is a simulated data set of 400 credit card customers.

Rows: 400
Columns: 12
$ Income      <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996,…
$ Limit       <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819…
$ Rating      <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138…
$ Cards       <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2…
$ Age         <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75…
$ Education   <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 1…
$ Own         <fct> No, Yes, No, Yes, No, No, Yes, No, Yes, Yes, No, No, Yes, …
$ Student     <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, …
$ Married     <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, …
$ Region      <fct> South, West, West, West, South, South, East, West, South, …
$ Balance     <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, …
$ Has_Balance <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No,…

Variables

Features (another name for predictors)

Cards: Number of credit cards
Rating: Credit Rating
Has_Balance: Whether they have a credit card balance

Outcome

Limit: Credit limit

Outcome: `Limit`

Code

Credit |> 
  gf_density(~Limit, fill = "steelblue") |> 
  gf_labs(title = "Distribution of credit limit",
          x = "Credit Limit")|> 
  gf_refine(scale_x_continuous(labels = dollar_format()))

	min	Q1	median	Q3	max	mean	sd	n	missing
	855	3088	4622.5	5872.75	13913	4735.6	2308.199	400	0

Predictors

Code

p1 <- Credit |> 
  gf_density(~Limit, fill = "steelblue") |> 
  gf_labs(title = "Distribution of credit limit",
          x = "Credit Limit")|> 
    gf_refine(scale_x_continuous(labels = dollar_format()))

p2 <- Credit |> 
  gf_histogram(~Rating, binwidth = 50) |> 
  gf_labs(title = "",
       x = "Credit Rating")

p3 <- Credit |> 
  gf_histogram(~Cards, binwidth = 1) |> 
  gf_labs(title = "",
       x = "Number of Credit Cards")

p4 <- Credit |> 
  gf_bar(~Has_Balance)|> 
  gf_labs(title = "",
       x = "Has a Credit Card Balance")

(p1 + p2) / (p3 + p4)

Outcome vs. predictors

Code

library(GGally)

Credit |> 
  select(Limit, Rating, Cards, Has_Balance) |> 
  ggpairs()

Single vs. multiple predictors

So far we’ve used a single predictor variable to understand variation in a quantitative response variable

Now we want to use multiple predictor variables to understand variation in a quantitative response variable

Multiple linear regression

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

\[ \begin{aligned}\hat{\text{Limit}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \text{Rating} + \hat{\beta}_2 \text{Cards} \end{aligned} \]

Similar to simple linear regression, this model assumes that at each combination of the predictor variables, the values of Limit follow a Normal distribution.

Multiple linear regression

Recall: The simple linear regression model assumes

\[ Y|X\sim N(\beta_0 + \beta_1 X, \sigma_{\epsilon}^2) \]

Similarly: The multiple linear regression model assumes

\[ Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_{\epsilon}^2) \]

Multiple linear regression

At any combination of the predictors, the mean value of the response $Y$, is

\[ \mu_{Y|X_1, \ldots, X_p} = \beta_0 + \beta_1 X_{1} + \beta_2 X_2 + \dots + \beta_p X_p \]

Using multiple linear regression, we can estimate the mean response for any combination of predictors

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_{1} + \hat{\beta}_2 X_2 + \dots + \hat{\beta}_p X_{p} \]

Model fit

term	estimate	std.error	statistic
(Intercept)	-341.15903	24.7670758	-13.77470
Rating	14.90573	0.0496795	300.03798
Cards	-72.31808	5.6054513	-12.90138

Model equation

\[ \begin{align}\hat{\text{Limit}} = -341.159 &+14.906 \times \text{Rating}\\ & -72.318 \times \text{Cards} \end{align} \]

Interpreting $\hat{\beta}_j$

The estimated coefficient $\hat{\beta}_j$ is the expected change in the mean of $y$ when $x_j$ increases by one unit, holding the values of all other predictor variables constant.

Complete Exercises 4-6.

Prediction

What is the predicted credit limit for a borrower with these characteristics?

head(Credit, n = 1) |> kable()

Income	Limit	Rating	Cards	Age	Education	Own	Student	Married	Region	Balance	Has_Balance
14.891	3606	283	2	34	11	No	No	Yes	South	333	Yes

-341.159 +14.906 * 283 -72.318 * 2

[1] 3732.603

The predicted credit limit for an borrower with an credit rating of 700 and who has 2 credit cards is $3733.

Prediction, revisited

Just like with simple linear regression, we can use the predict function in R to calculate the appropriate intervals for our predicted values:

       fit      lwr      upr
1 3732.526 3430.472 4034.581

Note

Difference in predicted value due to rounding the coefficients on the previous slide.

Complete Exercise 7.

Prediction interval for $\hat{y}$

Calculate a 90% confidence interval for the predicted credit limit for an individual borrower an credit rating of 700, and who has 2 credit cards.

predict(lim_fit, new_borrower, interval = "prediction", level = 0.90)

       fit      lwr      upr
1 3732.526 3479.216 3985.837

When would you use "confidence"? Would the interval be wider or narrower?

Cautions

Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
The multiple regression model only shows association, not causality
- To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Categorical Predictors

Indicator variables

Suppose there is a categorical variable with $K$ categories (levels)
We can make $K$ indicator variables - one indicator for each category
An indicator variable takes values 1 or 0
- 1 if the observation belongs to that category
- 0 if the observation does not belong to that category

Data manipulation: Create indicator variables for `Has_Balance`

Credit_dummy <- Credit |>
  mutate(
    No_Balance = if_else(Has_Balance == "No", 1, 0),
    Yes_Balance = if_else(Has_Balance == "Yes", 1, 0)
  )

Credit_dummy |>
  select(Has_Balance, No_Balance, Yes_Balance) |>
  slice(1, 12)

  Has_Balance No_Balance Yes_Balance
1         Yes          0           1
2          No          1           0

Indicators in the model

We will use $K-1$ of the indicator variables in the model.
The reference level or baseline is the category that doesn’t have a term in the model.
The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
This approach is also called dummy coding and R will do this for you

Credit_dummy |>
  select(Has_Balance, No_Balance, Yes_Balance) |>
  slice(1, 12)

  Has_Balance No_Balance Yes_Balance
1         Yes          0           1
2          No          1           0

Interpreting Categorical Predictors

term	estimate	std.error	statistic	p.value
(Intercept)	2152.722	194.211	11.084	0
Has_BalanceYes	3332.746	220.609	15.107	0

Where do we see each of the estimates in the plot?
Where do we see the values we’d predict in the plot?
Are Has_Balance and Limit correlated?

03:00

Complete Exercises 8-10.

Model equation

\[ \begin{align}\hat{\text{Limit}} = 2152.722 &+ 3332.746 \times \text{Yes_Balance} \end{align} \]

Adding in another predictor

Complete Exercise 11.

Credit Limit vs. Cards: parallel slopes

Code

cards_model <- lm(Limit ~ Cards, data = Credit)

cards_hasbal_model <- lm(Limit ~ Cards + Has_Balance, data = Credit)

p1 <- plotModel(cards_model) |>
  gf_labs(title = "SLR")
p2 <- plotModel(cards_hasbal_model) |>
  gf_labs(title = "Balance Indicator")

p1 + p2

Parallel slopes interpretation

term	estimate	std.error	statistic	p.value
(Intercept)	2257.044	272.007	8.298	0.000
Cards	-36.964	67.419	-0.548	0.584
Has_BalanceYes	3339.198	221.117	15.102	0.000

Slope of Cards is -36.964 regardless of Has_Balance level
Change in Has_Balance corresponds to a shift in the intercept
- Intercept for No is 2257.044
- For Yes shift intercept up 3339.198
  - (i.e. intercept $= 2257.044 + 3339.198 = 5596.242$)
Complete Exercise 12

Interaction terms

Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
This is an interaction effect.
To account for this, we can include interaction terms in the model.
We want a model of the form:

\[ \begin{aligned}\hat{Y} ~ = \hat{\beta}_0 & + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3X_1\times X_2 \end{aligned} \]

Interaction terms: Quantitative and Categorical

\[ \begin{aligned}\hat{\text{Limit}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \text{Cards} + \hat{\beta}_2 \text{Has_Balance} + \hat{\beta}_3\text{Cards}\times\text{Has_Balance} \end{aligned} \]

Interest rate vs. cards: interaction term

Bike Rentals vs. Temperature: interaction term

Interpreting interaction terms

term	estimate	std.error	statistic	p.value
(Intercept)	2324.664	515.510	4.509	0.000
Cards	-60.924	169.143	-0.360	0.719
Has_BalanceYes	3257.976	570.455	5.711	0.000
Cards:Has_BalanceYes	28.499	184.470	0.154	0.877

Slope of Rating depends on Has_Balance level
Equivalent: fit two separate linear models on the data corresponding to each level of Has_Balance

Understanding the model

\[ \begin{aligned} \hat{Limit} &= 2324.664 - 60.924 \times Cards \\ &\qquad+ 3257.976 \times Has\_Balance\\ &\qquad+ 28.499 \times Cards \times Has\_Balance \end{aligned} \]

Interpreting the interaction term

For a borrower in no balance, the slope of Cards is $-60.924$
For a borrower with a balance, the slope of Cards is $(-60.924 + 28.499=-32.425)$

Complete Exercise 13 and 14.

Interaction terms: Two Quantitative

\[ \begin{aligned}\hat{\text{Limit}} ~ = \hat{\beta}_0 & + \hat{\beta}_1 \text{Cards} + \hat{\beta}_2 \text{Rating} + \hat{\beta}_3\text{Cards}\times\text{Rating} \end{aligned} \]

Interpreting interaction terms

What the interaction means: The effect of the number of open credit cards on the credit limit depends on the borrowers credit rating and vice versa

Visualizing Model: No Interaction

Visualizing Interaction Model: Exaggerated

Visualizing Interaction Model: Real

Model Fit

term	estimate	std.error	statistic	p.value
(Intercept)	-315.00522	46.72441	-6.74177	0.00000
Cards	-81.11569	14.45641	-5.61105	0.00000
Rating	14.83433	0.11902	124.64042	0.00000
Cards:Rating	0.02376	0.03598	0.66030	0.50945

\[ \begin{aligned}\hat{\text{Limit}} ~ = & -315.005 + 14.834~\text{Rating} -81.116~\text{Cards}\\ & \qquad+ 0.024~\text{Rating}\times\text{Cards} \end{aligned} \]

Interpreting the interaction term

For a fixed Rating the slope of Cards is $(-81.116 + 0.024\times\text{Rating})$
For a fixed Cards the slope of Rating is $(14.834 + 0.024\times\text{Cards})$

What’s actually happening:

Credit_int <- Credit |>
  mutate(Interaction = Cards * Rating)

Credit_int |>
  select(Limit, Cards, Rating, Interaction) |>
  head() |>
  kable()

Limit	Cards	Rating	Interaction
3606	2	283	566
6645	3	483	1449
7075	4	514	2056
9504	3	681	2043
4897	2	357	714
8047	4	569	2276

What’s actually happening:

lm(Limit ~ Cards + Rating + Interaction, data = Credit_int) |>
  tidy() |>
  kable(digits = 5)

term	estimate	std.error	statistic	p.value
(Intercept)	-315.00522	46.72441	-6.74177	0.00000
Cards	-81.11569	14.45641	-5.61105	0.00000
Rating	14.83433	0.11902	124.64042	0.00000
Interaction	0.02376	0.03598	0.66030	0.50945

Wrap up

Recap

Introduced multiple linear regression
Interpreted coefficients in the multiple linear regression model
Calculated predictions and associated intervals for multiple linear regression models
Introduced categorical variables
Used interaction terms

Multiple linear regression (MLR)

Application Exercise

Computational setup

Considering multiple variables

Data: Credit Cards

Variables

Outcome: Limit

Predictors

Outcome vs. predictors

Single vs. multiple predictors

Multiple linear regression

Multiple linear regression (MLR)

Multiple linear regression

Multiple linear regression

Model fit

Model equation

Interpreting \(\hat{\beta}_j\)

Prediction

Prediction, revisited

Prediction interval for \(\hat{y}\)

Cautions

Categorical Predictors

Indicator variables

Data manipulation: Create indicator variables for Has_Balance

Indicators in the model

Interpreting Categorical Predictors

Model equation

Adding in another predictor

Credit Limit vs. Cards: parallel slopes

Parallel slopes interpretation

Interaction terms

Interaction terms

Interaction terms: Quantitative and Categorical

Interest rate vs. cards: interaction term

Bike Rentals vs. Temperature: interaction term

Interpreting interaction terms

Understanding the model

Interpreting the interaction term

Interaction terms: Two Quantitative

Interpreting interaction terms

Visualizing Model: No Interaction

Visualizing Interaction Model: Exaggerated

Visualizing Interaction Model: Real

Model Fit

Interpreting the interaction term

What’s actually happening:

What’s actually happening:

Wrap up

Recap

Outcome: `Limit`

Data manipulation: Create indicator variables for `Has_Balance`