Logistic regression

Introduction

Prof. Eric Friedlander

Application Exercise

📋 AE 09 - Intro to Logistic Regression

Complete Exercises 0-2.

Logistic regression

Topics

Introduction to modeling categorical data
Logistic regression for binary response variable
Relationship between odds and probabilities

Computational setup

# load packages
library(tidyverse)
library(ggformula)
library(broom)
library(knitr)
library(ggforce)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Predicting categorical outcomes

Types of outcome variables

Quantitative outcome variable:

Sales price of a house
Model: Expected sales price given the number of bedrooms, lot size, etc.

Categorical outcome variable:

Indicator for developing coronary heart disease in the next 10 years
Model: Probability an adult is high risk of heart disease in the next 10 years given their age, total cholesterol, etc.

Models for categorical outcomes

Logistic regression

2 Outcomes

1: “Success” (models probability of this category…)
0: “Failure”

Multinomial logistic regression

3+ Outcomes

1: Democrat
2: Republican
3: Independent

2024 election forecasts

The Economist

2020 NBA finals predictions

Source: FiveThirtyEight 2019-20 NBA Predictions

Data: Framingham Study

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use the patients age to predict if a randomly selected adult is high risk for heart disease in the next 10 years.

heart_disease <- read_csv("../data/framingham.csv") |>
  select(totChol, TenYearCHD, age, BMI, cigsPerDay, heartRate) |>
  drop_na()

Variables

Response:
- TenYearCHD:
  - 1: Patient developed heart disease within 10 years of exam
  - 0: Patient did not develop heart disease within 10 years of exam
Predictor:
- age: age in years at time of visit

Plot the data

Let’s fit a linear regression model

🛑 This model produces predictions outside of 0 and 1.

Let’s try another model

Let’s try another model: Zooming Out

✅ This model (called a logistic regression model) only produces predictions between 0 and 1.

The code

heart_disease |> 
  gf_point(TenYearCHD ~ age)  |>
  gf_hline(yintercept = c(0,1), lty = 2) |> 
  gf_labs(y = "CHD Risk", x = "Age") |> 
  gf_refine(stat_smooth(method ="glm", method.args = list(family = "binomial"), 
              fullrange = TRUE, se = FALSE))

Different types of models

Method	Outcome	Model
Linear regression	Quantitative	\(Y = \beta_0 + \beta_1~ X\)
Linear regression (transform Y)	Quantitative	\(\log(Y) = \beta_0 + \beta_1~ X\)
Logistic regression	Binary	\(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\)

Note: In this class (and in most college level math classes) ((and and in R)) \(\log\) means log base \(e\) (i.e. natural log)

Linear vs. logistic regression

Complete Exercise 3.

Linear vs. logistic regression

State whether a linear regression model or logistic regression model is more appropriate for each scenario.

Use age and education to predict if a randomly selected person will vote in the next election.
Use budget and run time (in minutes) to predict a movie’s total revenue.
Use age and sex to calculate the probability a randomly selected adult will visit St. Lukes in the next year.

Odds and probabilities

Binary response variable

\(Y\):
- 1: “success” (not necessarily a good thing)
- 0: “failure”
\(\pi\): probability that \(Y=1\), i.e., \(P(Y = 1)\)
\(\frac{\pi}{1-\pi}\): odds that \(Y = 1\)
\(\log\big(\frac{\pi}{1-\pi}\big)\): log-odds
Go from \(\pi\) to \(\log\big(\frac{\pi}{1-\pi}\big)\) using the logit transformation

Odds

Suppose there is a 70% chance it will rain tomorrow

Probability it will rain is \(\mathbf{p = 0.7}\)
Probability it won’t rain is \(\mathbf{1 - p = 0.3}\)
Odds it will rain are 7 to 3, 7:3, \(\mathbf{\frac{0.7}{0.3} \approx 2.33}\)
- For every 3 times it doesn’t rain, it will rain 7 times
- For every time it doesn’t rain, it will rain 2.33 times
Log-Odds it will rain is \(\log\mathbf{\frac{0.7}{0.3} \approx \log(2.33) \approx 0.847}\)
- Negative \(\Rightarrow\) probability of success less than 50-50 (0.5)
- Positive \(\Rightarrow\) probability of success greater than 50-50 (0.5)
- What are the log-odds of of a probability of 0? What about 1?

From log-odds to probabilities

log-odds

\[\omega = \log \frac{\pi}{1-\pi}\]

odds

\[e^\omega = \frac{\pi}{1-\pi}\]

probability

\[\pi = \frac{e^\omega}{1 + e^\omega}\]

Complete Exercise 4-5.

Logistic regression

From odds to probabilities

Logistic model: log-odds = \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\)
Odds = \(\exp\big\{\log\big(\frac{\pi}{1-\pi}\big)\big\} = \frac{\pi}{1-\pi}\)
Combining (1) and (2) with what we saw earlier

\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]

Logistic regression model

Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]

Probability form:

\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]

Variables

Response:
- TenYearCHD:
  - 1: Patient developed heart disease within 10 years of exam
  - 0: Patient did not develop heart disease within 10 years of exam
Predictors:
- age: age in years

Logistic regression

Logistic regression model

Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]

Probability form:

\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]

Today: Using R to fit this model.

TenYearCHD vs. age

heart_disease |> 
gf_sina(age ~ factor(TenYearCHD)) |> 
  gf_labs(x = "TenYearCHD - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. Ten YearCHD")

TenYearCHD vs. age

heart_disease |> 
gf_violin(age ~ factor(TenYearCHD), fill = "steelblue") |> 
  gf_labs(x = "TenYearCHD - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. TenYearCHD")

TenYearCHD vs. age

heart_disease |> 
gf_boxplot(age ~ factor(TenYearCHD), fill = "steelblue") |> 
  gf_sina(size = 0.75, alpha=0.25) |> 
  gf_labs(x = "TenYearCHD - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. TenYearCHD")

Let’s fit a model

heart_disease_fit <- glm(TenYearCHD ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable()

term	estimate	std.error	statistic	p.value
(Intercept)	-5.6614125	0.2899446	-19.52584	0
age	0.0763254	0.0053760	14.19754	0

The model

tidy(heart_disease_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.661	0.290	-19.526	0
age	0.076	0.005	14.198	0

\[\textbf{Logit form:}\qquad\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.561 + 0.076 \times \text{age}\]

\[\textbf{Probability form:}\qquad\hat{\pi} = \frac{\exp(-5.561 + 0.076 \times \text{age})}{1+\exp(-5.561 + 0.075 \times \text{age})}\]

where \(\hat{\pi}\) is the predicted probability of developing heart disease in the next 10 years.

Interpreting \(\hat{\beta}\)’s

tidy(heart_disease_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.661	0.290	-19.526	0
age	0.076	0.005	14.198	0

For every addition year of age, the log-odds of developing heart disease in the next 10 years, increases by 0.076.

Complete Exercises 6-8.

Interpretability of \(\beta\) for predicted probabilities

SLOPE IS CHANGING!
Increase in \(\hat{\pi}\) due to increase of 1 year of Age depends on what starting age is

`glm` and `augment`

The .fitted values in augment correspond to predictions from the logistic form of the model (i.e. the log-odds):

augment(heart_disease_fit)  |> head()

# A tibble: 6 × 8
  TenYearCHD   age .fitted .resid     .hat .sigma   .cooksd .std.resid
       <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>     <dbl>      <dbl>
1          0    39   -2.68 -0.363 0.000472  0.891 0.0000161     -0.363
2          0    46   -2.15 -0.469 0.000330  0.891 0.0000192     -0.469
3          0    48   -2.00 -0.504 0.000295  0.891 0.0000200     -0.504
4          1    61   -1.01  1.62  0.000730  0.891 0.000999       1.62 
5          0    46   -2.15 -0.469 0.000330  0.891 0.0000192     -0.469
6          0    43   -2.38 -0.421 0.000393  0.891 0.0000182     -0.421

Note: The residuals do not make sense here!

For observation 1

\[\text{predicted probability} = \hat{\pi} = \frac{\exp\{-2.680\}}{1 + \exp\{-2.680\}} = 0.0733\]

Using `predict` with `glm`

Default output is log-odds:

predict(heart_disease_fit, new_data = heart_disease) |> head() |> kable(digits = 3)

x
-2.685
-2.150
-1.998
-1.006
-2.150
-2.379

Using `predict` with `glm`

More commonly you want the predicted probability:

predict(heart_disease_fit, newdata = heart_disease, type = "response") |> head() |> kable(digits = 3)

x
0.064
0.104
0.119
0.268
0.104
0.085

Complete Exercise 9

Recap

Introduced logistic regression for binary response variable
Described relationship between odds and probabilities
Fit logistic regression models using glm
Interpreted coefficients in logistic regression models
Used logistic regression model to calculate predicted odds and probabilities
Use predict to make predictions using glm

Logistic regression

Application Exercise

Logistic regression

Topics

Computational setup

Predicting categorical outcomes

Types of outcome variables

Models for categorical outcomes

2024 election forecasts

2020 NBA finals predictions

Data: Framingham Study

Variables

Plot the data

Let’s fit a linear regression model

Let’s try another model

Let’s try another model: Zooming Out

The code

Different types of models

Linear vs. logistic regression

Linear vs. logistic regression

Odds and probabilities

Binary response variable

Odds

From log-odds to probabilities

Logistic regression

From odds to probabilities

Logistic regression model

Variables

Logistic regression

Logistic regression model

TenYearCHD vs. age

TenYearCHD vs. age

TenYearCHD vs. age

Let’s fit a model

The model

Interpreting \(\hat{\beta}\)’s

Interpretability of \(\beta\) for predicted probabilities

glm and augment

Using predict with glm

Using predict with glm

Recap

`glm` and `augment`

Using `predict` with `glm`

Using `predict` with `glm`