Logistic regression

Introduction

Prof. Eric Friedlander

Application Exercise

📋 AE 09 - Intro to Logistic Regression

  • Complete Exercises 0-2.

Logistic regression

Topics

  • Introduction to modeling categorical data

  • Logistic regression for binary response variable

  • Relationship between odds and probabilities

Computational setup

# load packages
library(tidyverse)
library(ggformula)
library(broom)
library(knitr)
library(ggforce)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_bw(base_size = 20))

Predicting categorical outcomes

Types of outcome variables

Quantitative outcome variable:

  • Sales price of a house
  • Model: Expected sales price given the number of bedrooms, lot size, etc.

Categorical outcome variable:

  • Indicator for developing coronary heart disease in the next 10 years
  • Model: Probability an adult is high risk of heart disease in the next 10 years given their age, total cholesterol, etc.

Models for categorical outcomes

Logistic regression

2 Outcomes

  • 1: “Success” (models probability of this category…)
  • 0: “Failure”

Multinomial logistic regression

3+ Outcomes

  • 1: Democrat
  • 2: Republican
  • 3: Independent

2024 election forecasts

The Economist

2020 NBA finals predictions

Source: FiveThirtyEight 2019-20 NBA Predictions

Data: Framingham Study

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use the patients age to predict if a randomly selected adult is high risk for heart disease in the next 10 years.

heart_disease <- read_csv("../data/framingham.csv") |>
  select(totChol, TenYearCHD, age, BMI, cigsPerDay, heartRate) |>
  drop_na()

Variables

  • Response:
    • TenYearCHD:
      • 1: Patient developed heart disease within 10 years of exam
      • 0: Patient did not develop heart disease within 10 years of exam
  • Predictor:
    • age: age in years at time of visit

Plot the data

Let’s fit a linear regression model

🛑 This model produces predictions outside of 0 and 1.

Let’s try another model

Let’s try another model: Zooming Out

✅ This model (called a logistic regression model) only produces predictions between 0 and 1.

The code

heart_disease |> 
  gf_point(TenYearCHD ~ age)  |>
  gf_hline(yintercept = c(0,1), lty = 2) |> 
  gf_labs(y = "CHD Risk", x = "Age") |> 
  gf_refine(stat_smooth(method ="glm", method.args = list(family = "binomial"), 
              fullrange = TRUE, se = FALSE))

Different types of models

Method Outcome Model
Linear regression Quantitative \(Y = \beta_0 + \beta_1~ X\)
Linear regression (transform Y) Quantitative \(\log(Y) = \beta_0 + \beta_1~ X\)
Logistic regression Binary \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\)

Note: In this class (and in most college level math classes) ((and and in R)) \(\log\) means log base \(e\) (i.e. natural log)

Linear vs. logistic regression

Complete Exercise 3.

Linear vs. logistic regression

State whether a linear regression model or logistic regression model is more appropriate for each scenario.

  1. Use age and education to predict if a randomly selected person will vote in the next election.

  2. Use budget and run time (in minutes) to predict a movie’s total revenue.

  3. Use age and sex to calculate the probability a randomly selected adult will visit St. Lukes in the next year.

Odds and probabilities

Binary response variable

  • \(Y\):
    • 1: “success” (not necessarily a good thing)
    • 0: “failure”
  • \(\pi\): probability that \(Y=1\), i.e., \(P(Y = 1)\)
  • \(\frac{\pi}{1-\pi}\): odds that \(Y = 1\)
  • \(\log\big(\frac{\pi}{1-\pi}\big)\): log-odds
  • Go from \(\pi\) to \(\log\big(\frac{\pi}{1-\pi}\big)\) using the logit transformation

Odds

Suppose there is a 70% chance it will rain tomorrow

  • Probability it will rain is \(\mathbf{p = 0.7}\)
  • Probability it won’t rain is \(\mathbf{1 - p = 0.3}\)
  • Odds it will rain are 7 to 3, 7:3, \(\mathbf{\frac{0.7}{0.3} \approx 2.33}\)
    • For every 3 times it doesn’t rain, it will rain 7 times
    • For every time it doesn’t rain, it will rain 2.33 times
  • Log-Odds it will rain is \(\log\mathbf{\frac{0.7}{0.3} \approx \log(2.33) \approx 0.847}\)
    • Negative \(\Rightarrow\) probability of success less than 50-50 (0.5)
    • Positive \(\Rightarrow\) probability of success greater than 50-50 (0.5)
    • What are the log-odds of of a probability of 0? What about 1?

From log-odds to probabilities

log-odds

\[\omega = \log \frac{\pi}{1-\pi}\]

odds

\[e^\omega = \frac{\pi}{1-\pi}\]

probability

\[\pi = \frac{e^\omega}{1 + e^\omega}\]

Complete Exercise 4-5.

Logistic regression

From odds to probabilities

  1. Logistic model: log-odds = \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\)
  2. Odds = \(\exp\big\{\log\big(\frac{\pi}{1-\pi}\big)\big\} = \frac{\pi}{1-\pi}\)
  3. Combining (1) and (2) with what we saw earlier

\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]

Logistic regression model

Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]

Probability form:

\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]

Variables

  • Response:
    • TenYearCHD:
      • 1: Patient developed heart disease within 10 years of exam
      • 0: Patient did not develop heart disease within 10 years of exam
  • Predictors:
    • age: age in years

Logistic regression

Logistic regression model

Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]

Probability form:

\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]

Today: Using R to fit this model.

TenYearCHD vs. age

heart_disease |> 
gf_sina(age ~ factor(TenYearCHD)) |> 
  gf_labs(x = "TenYearCHD - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. Ten YearCHD")

TenYearCHD vs. age

heart_disease |> 
gf_violin(age ~ factor(TenYearCHD), fill = "steelblue") |> 
  gf_labs(x = "TenYearCHD - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. TenYearCHD")

TenYearCHD vs. age

heart_disease |> 
gf_boxplot(age ~ factor(TenYearCHD), fill = "steelblue") |> 
  gf_sina(size = 0.75, alpha=0.25) |> 
  gf_labs(x = "TenYearCHD - 1: yes, 0: no",
       y = "Age", 
       title = "Age vs. TenYearCHD")

Let’s fit a model

heart_disease_fit <- glm(TenYearCHD ~ age, data = heart_disease, family = "binomial")

tidy(heart_disease_fit) |> kable()
term estimate std.error statistic p.value
(Intercept) -5.6614125 0.2899446 -19.52584 0
age 0.0763254 0.0053760 14.19754 0

The model

tidy(heart_disease_fit) |> kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -5.661 0.290 -19.526 0
age 0.076 0.005 14.198 0

\[\textbf{Logit form:}\qquad\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.561 + 0.076 \times \text{age}\]

\[\textbf{Probability form:}\qquad\hat{\pi} = \frac{\exp(-5.561 + 0.076 \times \text{age})}{1+\exp(-5.561 + 0.075 \times \text{age})}\]

where \(\hat{\pi}\) is the predicted probability of developing heart disease in the next 10 years.

Interpreting \(\hat{\beta}\)’s

tidy(heart_disease_fit) |> kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -5.661 0.290 -19.526 0
age 0.076 0.005 14.198 0

For every addition year of age, the log-odds of developing heart disease in the next 10 years, increases by 0.076.

Complete Exercises 6-8.

Interpretability of \(\beta\) for predicted probabilities

  • SLOPE IS CHANGING!
  • Increase in \(\hat{\pi}\) due to increase of 1 year of Age depends on what starting age is

glm and augment

The .fitted values in augment correspond to predictions from the logistic form of the model (i.e. the log-odds):

augment(heart_disease_fit)  |> head()
# A tibble: 6 × 8
  TenYearCHD   age .fitted .resid     .hat .sigma   .cooksd .std.resid
       <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>     <dbl>      <dbl>
1          0    39   -2.68 -0.363 0.000472  0.891 0.0000161     -0.363
2          0    46   -2.15 -0.469 0.000330  0.891 0.0000192     -0.469
3          0    48   -2.00 -0.504 0.000295  0.891 0.0000200     -0.504
4          1    61   -1.01  1.62  0.000730  0.891 0.000999       1.62 
5          0    46   -2.15 -0.469 0.000330  0.891 0.0000192     -0.469
6          0    43   -2.38 -0.421 0.000393  0.891 0.0000182     -0.421

Note: The residuals do not make sense here!

For observation 1

\[\text{predicted probability} = \hat{\pi} = \frac{\exp\{-2.680\}}{1 + \exp\{-2.680\}} = 0.0733\]

Using predict with glm

Default output is log-odds:

predict(heart_disease_fit, new_data = heart_disease) |> head() |> kable(digits = 3)
x
-2.685
-2.150
-1.998
-1.006
-2.150
-2.379

Using predict with glm

More commonly you want the predicted probability:

predict(heart_disease_fit, newdata = heart_disease, type = "response") |> head() |> kable(digits = 3)
x
0.064
0.104
0.119
0.268
0.104
0.085

Complete Exercise 9

Recap

  • Introduced logistic regression for binary response variable
  • Described relationship between odds and probabilities
  • Fit logistic regression models using glm
  • Interpreted coefficients in logistic regression models
  • Used logistic regression model to calculate predicted odds and probabilities
  • Use predict to make predictions using glm