rail_trail# A tibble: 90 × 7
volume hightemp avgtemp season cloudcover precip day_type
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 501 83 66.5 Summer 7.60 0 Weekday
2 419 73 61 Summer 6.30 0.290 Weekday
3 397 74 63 Spring 7.5 0.320 Weekday
4 385 95 78 Summer 2.60 0 Weekend
5 200 44 48 Spring 10 0.140 Weekday
6 375 69 61.5 Spring 6.60 0.0200 Weekday
7 417 66 52.5 Spring 2.40 0 Weekday
8 629 66 52 Spring 0 0 Weekend
9 533 80 67.5 Summer 3.80 0 Weekend
10 547 79 62 Summer 4.10 0 Weekday
# ℹ 80 more rows
Source: Pioneer Valley Planning Commission via the mosaicData package.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 17.622161 | 76.582860 | 0.2301058 | 0.8185826 |
| hightemp | 7.070528 | 2.420523 | 2.9210743 | 0.0045045 |
| avgtemp | -2.036685 | 3.142113 | -0.6481896 | 0.5186733 |
| seasonSpring | 35.914983 | 32.992762 | 1.0885716 | 0.2795319 |
| seasonSummer | 24.153571 | 52.810486 | 0.4573632 | 0.6486195 |
| cloudcover | -7.251776 | 3.843071 | -1.8869743 | 0.0627025 |
| precip | -95.696525 | 42.573359 | -2.2478030 | 0.0272735 |
| day_typeWeekend | 35.903750 | 22.429056 | 1.6007696 | 0.1132738 |
Let’s assume the true population regression equation is \(y = 3 + 4x\)
Suppose we try estimating that equation using a model with variables \(x\) and \(z = x/10\)
\[ \begin{aligned}\hat{y}&= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2z\\ &= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2\frac{x}{10}\\ &= \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)x \end{aligned} \]
\[\hat{y} = \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)x\]
We can set \(\hat{\beta}_1\) and \(\hat{\beta}_2\) to any two numbers such that \(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10} = 4\)
Therefore, we are unable to choose the “best” combination of \(\hat{\beta}_1\) and \(\hat{\beta}_2\)
In statistics, we say this model is “unidentifiable” because different parameters combinations can result in the same model
This is also why we need to set a reference level for categorical variables
Complete Exercises 1-2.
When we have perfect collinearities, we are unable to get estimates for the coefficients
When we have almost perfect collinearities (i.e. highly correlated predictor variables), the standard errors for our regression coefficients inflate
In other words, we lose precision in our estimates of the regression coefficients
This impedes our ability to use the model for inference
It is also difficult to interpret the model coefficients
Multicollinearity may occur when…
There are very high correlations \((r > 0.9)\) among two or more predictor variables, especially when the sample size is small
One (or more) predictor variables is an almost perfect linear combination of the others
There are interactions between two or more continuous variables
Variance Inflation Factor (VIF): Measure of multicollinearity in the regression model
\[VIF(\hat{\beta}_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}\]
where \(R^2_{X_j|X_{-j}}\) is the proportion of variation in \(X_j\) that is explained by the linear combination of the other explanatory variables in the model.
Typically \(VIF > 10\) indicates concerning multicollinearity
Variables with similar values of VIF are typically the ones correlated with each other
Use the vif() function in the rms R package to calculate VIF
Complete Exercise 5.
hightemp avgtemp seasonSpring seasonSummer cloudcover
10.259978 13.086175 2.751577 5.841985 1.587485
precip day_typeWeekend
1.295352 1.125741
hightemp and avgtemp are correlated.
temp_comsite that is the average of avgtemp and hightemp.avgtemp and hightemp together with their individual \(\beta\)’s and p-values not having much meaning.Complete Exercises 6 & 7.
hightemp| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 76.071 | 77.204 | 0.985 | 0.327 |
| avgtemp | 6.003 | 1.583 | 3.792 | 0.000 |
| seasonSpring | 34.555 | 34.454 | 1.003 | 0.319 |
| seasonSummer | 13.531 | 55.024 | 0.246 | 0.806 |
| cloudcover | -12.807 | 3.488 | -3.672 | 0.000 |
| precip | -110.736 | 44.137 | -2.509 | 0.014 |
| day_typeWeekend | 48.420 | 22.993 | 2.106 | 0.038 |
avgtemp| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.421 | 74.992 | 0.112 | 0.911 |
| hightemp | 5.696 | 1.164 | 4.895 | 0.000 |
| seasonSpring | 31.239 | 32.082 | 0.974 | 0.333 |
| seasonSummer | 9.424 | 47.504 | 0.198 | 0.843 |
| cloudcover | -8.353 | 3.435 | -2.431 | 0.017 |
| precip | -98.904 | 42.137 | -2.347 | 0.021 |
| day_typeWeekend | 37.062 | 22.280 | 1.663 | 0.100 |
temp_composite| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 18.823 | 77.430 | 0.243 | 0.809 |
| seasonSpring | 28.458 | 33.059 | 0.861 | 0.392 |
| seasonSummer | -0.986 | 51.234 | -0.019 | 0.985 |
| cloudcover | -10.367 | 3.409 | -3.041 | 0.003 |
| precip | -104.475 | 42.725 | -2.445 | 0.017 |
| day_typeWeekend | 40.914 | 22.479 | 1.820 | 0.072 |
| temp_composite | 6.292 | 1.376 | 4.571 | 0.000 |
Model without hightemp:
| adj.r.squared | AIC | BIC |
|---|---|---|
| 0.42 | 1087.5 | 1107.5 |
Model without avgtemp:
| adj.r.squared | AIC | BIC |
|---|---|---|
| 0.47 | 1079.05 | 1099.05 |
Model with temp_composite:
| adj.r.squared | AIC | BIC |
|---|---|---|
| 0.46 | 1081.67 | 1101.67 |
Based on Adjusted \(R^2\), AIC, and BIC, the model without avgtemp is a better fit. Therefore, we choose to remove avgtemp from the model and leave hightemp in the model to deal with the multicollinearity.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8.421 | 74.992 | 0.112 | 0.911 |
| hightemp | 5.696 | 1.164 | 4.895 | 0.000 |
| seasonSpring | 31.239 | 32.082 | 0.974 | 0.333 |
| seasonSummer | 9.424 | 47.504 | 0.198 | 0.843 |
| cloudcover | -8.353 | 3.435 | -2.431 | 0.017 |
| precip | -98.904 | 42.137 | -2.347 | 0.021 |
| day_typeWeekend | 37.062 | 22.280 | 1.663 | 0.100 |