MLR: Inference

Prof. Eric Friedlander

Application Exercise

📋 AE 15 - Inference for MLR

  • Complete Exercises 0-1

Topics

  • Inference for multiple linear regression
  • Effect Size and Power

Computational setup

# load packages
library(tidyverse)
library(broom)
library(mosaic)
library(Stat2Data)
library(patchwork)
library(knitr)
library(kableExtra)
library(scales)
library(GGally)
library(countdown)
library(rms)
library(coursekata)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Data: Tadpoles

Biologists wondered whether tadpoles can adjust the relative length of their intestines if they are exposed to a fungus called Batrachochytrium dendrobatidis (Bd).

Rows: 27
Columns: 4
$ Treatment       <fct> Bd, Bd, Bd, Bd, Bd, Bd, Bd, Bd, Bd, Bd, Bd, Bd, Bd, Bd…
$ Body            <dbl> 20.28, 19.75, 19.28, 17.81, 20.79, 16.99, 18.12, 16.95…
$ GutLength       <dbl> 191.40, 142.92, 169.81, 152.18, 171.33, 153.92, 181.08…
$ MouthpartDamage <dbl> 0.679, 0.607, 0.750, 0.821, 0.696, 0.571, 0.571, 0.571…

Source: Stat2Data.

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

p1 <- gf_jitter(GutLength ~ Treatment, data = Tadpoles, alpha = 0.5, height = 0, width = 0.1) |>
  gf_model(lm(GutLength ~ Treatment, data = Tadpoles), color = "red") |>
  gf_labs(x = "Treatment", y = "Gut Length (mm)")
p2 <- gf_point(GutLength ~ MouthpartDamage, data = Tadpoles, alpha = 0.5) |>
  gf_model(lm(GutLength ~ MouthpartDamage, data = Tadpoles), color = "red") |>
  gf_labs(x = "Mouthpart Damage", y = "Gut Length (mm)")
p1 + p2

SLR model summary

Code
treatment_model <- lm(GutLength ~ Treatment, data = Tadpoles)
damage_model <- lm(GutLength ~ MouthpartDamage, data = Tadpoles)
tidy(treatment_model) |> kable()
term estimate std.error statistic p.value
(Intercept) 168.34714 7.357157 22.882092 0.0000000
TreatmentControl 33.02824 10.602792 3.115051 0.0045722
Code
tidy(damage_model) |> kable()
term estimate std.error statistic p.value
(Intercept) 137.57782 39.07066 3.521256 0.0016738
MouthpartDamage 70.85005 58.59173 1.209216 0.2378918

SLR hypothesis test

  1. Set hypotheses: \(H_0: \beta_1 = 0\) vs. \(H_A: \beta_1 \ne 0\)
  1. Calculate test statistic and p-value . . .

  2. State the conclusion

Multiple linear regression

  • Discussion: How should we handle our confounding variables?

Complete Exercise 2.

Multiple linear regression

The multiple linear regression model assumes \[Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_\epsilon^2)\]

For a given observation \((x_{i1}, x_{i2}, \ldots, x_{ip}, y_i)\), we can rewrite the previous statement as

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i}, \hspace{10mm} \epsilon_i \sim N(0,\sigma_{\epsilon}^2)\]

Estimating \(\sigma_\epsilon\)

For a given observation \((x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)\) the residual is \[ \begin{aligned} e_i &= y_{i} - \hat{y_i}\\ &= y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip}) \end{aligned} \]

The estimated value of the regression standard error , \(\sigma_{\epsilon}\), is

\[\hat{\sigma}_\epsilon = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n-p-1}}\]

As with SLR, we use \(\hat{\sigma}_{\epsilon}\) to calculate \(SE_{\hat{\beta}_j}\), the standard error of each coefficient. See Matrix Form of Linear Regression for more detail.

MLR Model

lm(GutLength ~ Treatment + Body, data = Tadpoles) |>
  tidy()
# A tibble: 3 × 5
  term             estimate std.error statistic p.value
  <chr>               <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)         15.5      53.8      0.288 0.776  
2 TreatmentControl    16.3      11.0      1.48  0.153  
3 Body                 8.07      2.82     2.86  0.00859

MLR hypothesis test: treatment

  1. Set hypotheses: \(H_0: \beta_{trmt} = 0\) vs. \(H_A: \beta_{trmt} \ne 0\), given other Body is in the model
  1. Calculate test statistic and p-value: The test statistic is \(t = 1.48\). The p-value is calculated using a \(t\) distribution with \[(n - p - 1) = 27-2-1 = 24\] degrees of freedom. The p-value is \(\approx 0.153\).

    • Note that \(p\) counts all non-intercept \(\beta\)’s
  1. State the conclusion: The p-value provides no evidence in favor of the alternative, so we fail to reject \(H_0\). The data does not provide evidence that a exposure to the Bd fungus is a useful predictor in a model that already contains body length as a predictor of gut length (could also say when controlling for body length).

MLR hypothesis test: interaction terms

  • Framework: Same as previous slide except use \(\beta\) of interaction term
  • \(p\) (the number of predictors) should include the interaction term
  • Conclusion: tells you whether interaction term is a useful predictor
  • Warning: if \(X_1\) and \(X_2\) have an interaction term, don’t try to interpret their individual p-values… if interaction term is significant, then both variables are important

Complete Exercises 3.

Effect Size and Power

Overview

  • Effect Size:
    • General Idea: how big is your parameter compared to the variability of your data
    • Why is it important: if parameter has small effect size, it may not be a useful parameter to include. However, what effect size is useful depends on your problem
    • In practice: many formal ways of determining effects size which we won’t talk about
  • Statistical Power: the ability of a hypothesis test to detect a given effect size
    • more data = more power
    • tons of data = tons of power
    • In practice: think about how what effect size you’d like to be able to detect and design your study to collect enough data to do so

Effect Size

  • A few measures of effect size
  • Cohen’s \(d\) (for when you have two groups: e.g. treament vs. control)
  • \(R^2\) (for when you have continuous predictors)
  • When evaluating effect size, consider practical significance not just statistical significance
  • Effect size DOESN’T depend on sample size (other than sampling variability)
  • Statistical significance DOES depend on sample size

Power

  • Power: probability of correctly rejecting null hypothesis given a certain effect size
  • Power depends on:
    • Effect size
    • Sample size
    • Significance level (\(\alpha\))
    • Variability in the data
  • In practice: decide on a minimum effect size you want to be able to detect, then design your study to have enough power to detect that effect size

Power Analysis using Simulation

lm(GutLength ~ Treatment + Body, data = Tadpoles) |>
  tidy()
# A tibble: 3 × 5
  term             estimate std.error statistic p.value
  <chr>               <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)         15.5      53.8      0.288 0.776  
2 TreatmentControl    16.3      11.0      1.48  0.153  
3 Body                 8.07      2.82     2.86  0.00859

Power Analysis using Simulation I

set.seed(2025)
sample_size <- 50
lm(GutLength ~ Treatment + Body, data = resample(Tadpoles, sample_size)) |>
  tidy()
# A tibble: 3 × 5
  term             estimate std.error statistic p.value
  <chr>               <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)         35.1      42.3      0.831 0.410  
2 TreatmentControl    16.9       8.04     2.10  0.0407 
3 Body                 7.24      2.18     3.32  0.00176

Power Analysis using Simulation II

sample_size <- 50
lm(GutLength ~ Treatment + Body, data = resample(Tadpoles, sample_size)) |>
  tidy()
# A tibble: 3 × 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)         11.1      34.2      0.326 0.746   
2 TreatmentControl    27.0       8.14     3.32  0.00174 
3 Body                 7.59      1.82     4.18  0.000125

Power Analysis using Simulation III

sample_size <- 50
do(10) * lm(GutLength ~ Treatment + Body, data = resample(Tadpoles, sample_size)) |>
  tidy()
# A tibble: 30 × 7
   term             estimate std.error statistic    p.value  .row .index
   <chr>               <dbl>     <dbl>     <dbl>      <dbl> <int>  <dbl>
 1 (Intercept)        24.6       35.1     0.700  0.487          1      1
 2 TreatmentControl   16.8        8.62    1.94   0.0580         2      1
 3 Body                7.59       1.85    4.10   0.000165       3      1
 4 (Intercept)       -15.8       34.4    -0.458  0.649          1      2
 5 TreatmentControl   -0.741      8.80   -0.0842 0.933          2      2
 6 Body               10.1        1.83    5.51   0.00000146     3      2
 7 (Intercept)        21.5       43.6     0.493  0.624          1      3
 8 TreatmentControl   30.1        8.29    3.63   0.000700       2      3
 9 Body                7.16       2.33    3.07   0.00357        3      3
10 (Intercept)        27.1       36.9     0.734  0.467          1      4
# ℹ 20 more rows

Power Analysis using Simulation IV

sample_size <- 50
do(10) * lm(GutLength ~ Treatment + Body, data = resample(Tadpoles, sample_size)) |>
  tidy() |>
  filter(term == "TreatmentControl")
# A tibble: 10 × 7
   term             estimate std.error statistic p.value  .row .index
   <chr>               <dbl>     <dbl>     <dbl>   <dbl> <int>  <dbl>
 1 TreatmentControl    10.5       8.31     1.26  0.213       1      1
 2 TreatmentControl     9.60      6.03     1.59  0.118       1      2
 3 TreatmentControl    11.6       8.20     1.41  0.164       1      3
 4 TreatmentControl    16.5       7.24     2.27  0.0276      1      4
 5 TreatmentControl    22.3       7.36     3.03  0.00399     1      5
 6 TreatmentControl     2.45      6.63     0.369 0.714       1      6
 7 TreatmentControl    17.2       8.38     2.06  0.0454      1      7
 8 TreatmentControl    20.8       8.31     2.51  0.0158      1      8
 9 TreatmentControl    24.1       7.98     3.02  0.00405     1      9
10 TreatmentControl    19.4       6.76     2.88  0.00602     1     10

Power Analysis using Simulation IV

sample_size <- 50
samp_dist <- do(1000) * lm(GutLength ~ Treatment + Body, data = resample(Tadpoles, sample_size)) |>
  tidy() |>
  filter(term == "TreatmentControl")

tally(samp_dist$p.value < 0.05, format = "proportion")
X
 TRUE FALSE 
 0.56  0.44 

The power for a sample of size 50 if we assume the same effect size as in our data is approximately 0.56.

Complete Exercise 4.

Confidence interval for \(\beta_j\)

Confidence interval for \(\beta_j\)

  • The \(C\%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.

  • Generically: We are \(C\%\) confident that the interval LB to UB contains the population coefficient of \(x_j\).

  • In context: We are \(C\%\) confident that for every one unit increase in \(x_j\), we expect \(y\) to change by LB to UB units, holding all else constant.

Complete Exercise 5.

Inference pitfalls

Large sample sizes

Caution

If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\).

  • Consider the practical significance of the result not just the statistical significance.

  • Instead of saying statistically significant say statistically detectable

  • Use confidence intervals to draw conclusions instead of relying only on p-values.

Small sample sizes

Caution

If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).

  • When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.

  • There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

Recap

  • Inference for individual predictors
    • Hypothesis testing: same as SLR but add if all other terms are in the model
    • Confidence intervals: same as SLR but add holding all other variables in the model constant
  • Effect size and power
    • Effect size: how big is your parameter compared to variability of data
    • Power: ability to detect a given effect size