Introduction
📋 AE 09 - Intro to Logistic Regression
Introduction to modeling categorical data
Logistic regression for binary response variable
Relationship between odds and probabilities
Quantitative outcome variable:
Categorical outcome variable:
Logistic regression
2 Outcomes
Multinomial logistic regression
3+ Outcomes
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use the patients age to predict if a randomly selected adult is high risk for heart disease in the next 10 years.
TenYearCHD
:
age
: age in years at time of visit🛑 This model produces predictions outside of 0 and 1.
✅ This model (called a logistic regression model) only produces predictions between 0 and 1.
Method | Outcome | Model |
---|---|---|
Linear regression | Quantitative | \(Y = \beta_0 + \beta_1~ X\) |
Linear regression (transform Y) | Quantitative | \(\log(Y) = \beta_0 + \beta_1~ X\) |
Logistic regression | Binary | \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\) |
Note: In this class (and in most college level math classes) ((and and in R)) \(\log\) means log base \(e\) (i.e. natural log)
Complete Exercise 3.
State whether a linear regression model or logistic regression model is more appropriate for each scenario.
Use age and education to predict if a randomly selected person will vote in the next election.
Use budget and run time (in minutes) to predict a movie’s total revenue.
Use age and sex to calculate the probability a randomly selected adult will visit St. Lukes in the next year.
Suppose there is a 70% chance it will rain tomorrow
log-odds
\[\omega = \log \frac{\pi}{1-\pi}\]
odds
\[e^\omega = \frac{\pi}{1-\pi}\]
probability
\[\pi = \frac{e^\omega}{1 + e^\omega}\]
Complete Exercise 4-5.
\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]
Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]
Probability form:
\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]
TenYearCHD
:
age
: age in yearsLogit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]
Probability form:
\[ \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}} \]
Today: Using R to fit this model.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -5.661 | 0.290 | -19.526 | 0 |
age | 0.076 | 0.005 | 14.198 | 0 |
\[\textbf{Logit form:}\qquad\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.561 + 0.076 \times \text{age}\]
\[\textbf{Probability form:}\qquad\hat{\pi} = \frac{\exp(-5.561 + 0.076 \times \text{age})}{1+\exp(-5.561 + 0.075 \times \text{age})}\]
where \(\hat{\pi}\) is the predicted probability of developing heart disease in the next 10 years.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -5.661 | 0.290 | -19.526 | 0 |
age | 0.076 | 0.005 | 14.198 | 0 |
For every addition year of age, the log-odds of developing heart disease in the next 10 years, increases by 0.076.
Complete Exercises 6-8.
glm
and augment
The .fitted
values in augment
correspond to predictions from the logistic form of the model (i.e. the log-odds):
# A tibble: 6 × 8
TenYearCHD age .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 39 -2.68 -0.363 0.000472 0.891 0.0000161 -0.363
2 0 46 -2.15 -0.469 0.000330 0.891 0.0000192 -0.469
3 0 48 -2.00 -0.504 0.000295 0.891 0.0000200 -0.504
4 1 61 -1.01 1.62 0.000730 0.891 0.000999 1.62
5 0 46 -2.15 -0.469 0.000330 0.891 0.0000192 -0.469
6 0 43 -2.38 -0.421 0.000393 0.891 0.0000182 -0.421
Note: The residuals do not make sense here!
For observation 1
\[\text{predicted probability} = \hat{\pi} = \frac{\exp\{-2.680\}}{1 + \exp\{-2.680\}} = 0.0733\]
predict
with glm
Default output is log-odds:
predict
with glm
More commonly you want the predicted probability:
predict(heart_disease_fit, newdata = heart_disease, type = "response") |> head() |> kable(digits = 3)
x |
---|
0.064 |
0.104 |
0.119 |
0.268 |
0.104 |
0.085 |
Complete Exercise 9
glm
predict
to make predictions using glm