MATH 4780 / MSSC 5780 Regression Analysis
So far, we assume that we
Model Adequacy
Model Selection
Two “conflicting” goals in model building:
as many regressors as possible for better predictive performance on new data (smaller bias).
as few regressors as possible because as the number of regressors increases,
A compromise between the two hopefully leads to the “best” regression equation.
What does best mean?
There is no unique definition of “best”, and different methods specify different subsets of the candidate regressors as best.
The model performs well on the current sample data, but awfully predicts the response on the new data we are interested.
The model performs well on the current sample data, but awfully predicts the response on the new data we are interested.
Often, we don’t have another unused data to assess the performance of our model.
Solution: Pretend we have new data by splitting our data into training set and test set (validation set)!
The full (largest) model has \(M\) candidate regressors.
There are \(M \choose p-1\) possible subset models of size \(p\).
There are totally \(2^M\) possible subset models.
An evaluation metric should consider Goodness of Fit and Model Complexity:
Goodness of Fit: The more regressors, the better
Complexity Penalty: The less regressors, the better
For a model with \(p\) coefficients ( \(k\) predictors ),
\[\begin{align} C_p &= \frac{SS_{res}(p)}{\hat{\sigma}^2} - n + 2p \\ &= p + \frac{(s^2 - \hat{\sigma}^2)(n-p)}{\hat{\sigma}^2} \end{align}\]
\(\hat{\sigma}^2\) is the variance estimate from the full model, i.e., \(\hat{\sigma}^2 = MS_{res}(M)\).
\(s^2\) is the variance estimate from the model with \(p\) coefficients, i.e., \(s^2 = MS_{res}(p)\).
Favors the candidate model with the smallest \(C_p\).
For unbiased models that \(E[\hat{y}_i] = E[y_i]\), \(C_p = p\).
For a model with \(p\) coefficients ( \(k\) predictors ),
Predicted Residual Error Sum of Squares (PRESS)
\(\text{PRESS}_p = \sum_{i=1}^n[y_i - \hat{y}_{(i)}]^2 = \sum_{i=1}^n\left( \frac{e_i}{1-h_{ii}}\right)^2\) where \(e_i = y_i - \hat{y}_i\).
\(R_{pred, p}^2 = 1 - \frac{PRESS_p}{SS_T}\)
\(\text{Absolute PRESS}_p = \sum_{i=1}^n|y_i - \hat{y}_{(i)}|\) can also be considered when some large prediction errors are too influential.
manpower <- read.csv(file = "./data/manpower.csv", header = TRUE)
lm_full <- lm(y ~ x1 + x2 + x3 + x4 + x5,
data = manpower)
summ_full <- summary(lm_full)
## Adjusted R sq.
summ_full$adj.r.squared
[1] 0.987
[1] 32195222
ols_mallows_cp()
for Mallow’s \(C_p\) in olsrr
packageols_step_all_possible()
[1] "mindex" "n" "predictors" "rsquare" "adjr"
[6] "predrsq" "cp" "aic" "sbic" "sbc"
[11] "msep" "fpe" "apc" "hsp"
n
: number of predictorspredictors
: predictors in the modelrsquare
: R-square of the modeladjr
: adjusted R-square of the modelpredrsq
: predicted R-square of the modelcp
: Mallow’s Cpaic
: AICsbic
: Sawa BICsbc
: Schwarz BIC (the one we defined)Model (x2 x3 x5)
Index N Predictors R-Square Adj. R-Square Mallow's Cp
3 1 1 x3 0.972 0.970 20.38
1 2 1 x1 0.971 0.970 21.20
2 3 1 x2 0.893 0.886 114.97
4 4 1 x4 0.884 0.877 125.87
5 5 1 x5 0.335 0.290 785.26
10 6 2 x2 x3 0.987 0.985 4.94
6 7 2 x1 x2 0.986 0.984 5.66
14 8 2 x3 x5 0.985 0.983 7.29
9 9 2 x1 x5 0.984 0.982 8.16
13 10 2 x3 x4 0.975 0.972 18.57
8 11 2 x1 x4 0.974 0.970 20.04
7 12 2 x1 x3 0.973 0.969 21.99
11 13 2 x2 x4 0.931 0.921 72.29
12 14 2 x2 x5 0.924 0.913 80.30
15 15 2 x4 x5 0.910 0.898 96.50
23 16 3 x2 x3 x5 0.990 0.988 2.92
18 17 3 x1 x2 x5 0.989 0.987 3.71
16 18 3 x1 x2 x3 0.987 0.984 6.21
22 19 3 x2 x3 x4 0.987 0.984 6.90
17 20 3 x1 x2 x4 0.986 0.983 7.66
20 21 3 x1 x3 x5 0.985 0.982 8.97
25 22 3 x3 x4 x5 0.985 0.982 9.00
21 23 3 x1 x4 x5 0.985 0.981 9.41
19 24 3 x1 x3 x4 0.978 0.974 16.80
24 25 3 x2 x4 x5 0.952 0.941 48.28
30 26 4 x2 x3 x4 x5 0.991 0.988 4.03
28 27 4 x1 x2 x4 x5 0.991 0.987 4.26
27 28 4 x1 x2 x3 x5 0.991 0.987 4.35
26 29 4 x1 x2 x3 x4 0.988 0.984 7.54
29 30 4 x1 x3 x4 x5 0.985 0.980 10.92
31 31 5 x1 x2 x3 x4 x5 0.991 0.987 6.00
ols_step_best_subset()
# metric = c("rsquare", "adjr", "predrsq", "cp", "aic", "sbic", "sbc", "msep", "fpe", "apc", "hsp")
olsrr::ols_step_best_subset(lm_full, metric = "predrsq")
Best Subsets Regression
-----------------------------
Model Index Predictors
-----------------------------
1 x3
2 x2 x3
3 x2 x3 x5
4 x2 x3 x4 x5
5 x1 x2 x3 x4 x5
-----------------------------
Subsets Regression Summary
-------------------------------------------------------------------------------------------------------------------------------------------------
Adj. Pred
Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
-------------------------------------------------------------------------------------------------------------------------------------------------
1 0.9722 0.9703 0.956 20.3812 285.5158 234.8274 288.0155 15612703.3319 1025426.9349 65534.8041 0.0352
2 0.9867 0.9848 0.964 4.9416 274.9519 227.0975 278.2847 8029607.6257 552301.0536 36111.9920 0.0190
3 0.9901 0.9878 0.964 2.9177 272.0064 226.8104 276.1724 6503027.4301 466884.0206 31496.1442 0.0160
4 0.9908 0.9877 0.942 4.0263 272.6849 229.2716 277.6841 6563621.8068 490245.8263 34438.7564 0.0168
5 0.9908 0.9867 0.935 6.0000 274.6442 232.3507 280.4767 7202730.2305 557787.1896 41227.7488 0.0192
-------------------------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria
SBIC: Sawa's Bayesian Information Criteria
SBC: Schwarz Bayesian Criteria
MSEP: Estimated error of prediction, assuming multivariate normality
FPE: Final Prediction Error
HSP: Hocking's Sp
APC: Amemiya Prediction Criteria
k | 1 | 2 | 3 | 4 | 5 | r2 | adj_r2 | cp | press | abs_press | r2_pred | aic | bic |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 0 | 0 | 0.972 | 0.970 | 20.38 | 2.18e+07 | 13431 | 0.956 | 235 | 237 |
1 | 1 | 0 | 0 | 0 | 0 | 0.971 | 0.970 | 21.20 | 2.22e+07 | 13666 | 0.955 | 236 | 237 |
1 | 0 | 1 | 0 | 0 | 0 | 0.893 | 0.886 | 114.97 | 1.58e+08 | 28031 | 0.680 | 258 | 260 |
1 | 0 | 0 | 0 | 1 | 0 | 0.884 | 0.877 | 125.87 | 7.01e+07 | 22890 | 0.858 | 260 | 261 |
1 | 0 | 0 | 0 | 0 | 1 | 0.335 | 0.290 | 785.26 | 4.56e+08 | 61324 | 0.079 | 289 | 291 |
2 | 0 | 1 | 1 | 0 | 0 | 0.987 | 0.985 | 4.94 | 1.79e+07 | 12742 | 0.964 | 225 | 227 |
2 | 1 | 1 | 0 | 0 | 0 | 0.986 | 0.984 | 5.66 | 1.80e+07 | 12873 | 0.964 | 225 | 228 |
2 | 0 | 0 | 1 | 0 | 1 | 0.985 | 0.983 | 7.29 | 1.26e+07 | 10915 | 0.974 | 227 | 230 |
2 | 1 | 0 | 0 | 0 | 1 | 0.984 | 0.982 | 8.16 | 1.30e+07 | 11074 | 0.974 | 228 | 230 |
2 | 0 | 0 | 1 | 1 | 0 | 0.975 | 0.972 | 18.57 | 3.25e+07 | 16825 | 0.934 | 235 | 238 |
2 | 1 | 0 | 0 | 1 | 0 | 0.974 | 0.970 | 20.04 | 3.58e+07 | 17328 | 0.928 | 236 | 239 |
2 | 1 | 0 | 1 | 0 | 0 | 0.973 | 0.969 | 21.99 | 2.26e+07 | 13957 | 0.954 | 237 | 240 |
2 | 0 | 1 | 0 | 1 | 0 | 0.931 | 0.921 | 72.29 | 1.41e+08 | 28166 | 0.716 | 253 | 255 |
2 | 0 | 1 | 0 | 0 | 1 | 0.924 | 0.913 | 80.30 | 1.29e+08 | 25960 | 0.740 | 254 | 257 |
2 | 0 | 0 | 0 | 1 | 1 | 0.910 | 0.898 | 96.50 | 7.12e+07 | 25251 | 0.856 | 257 | 260 |
k | 1 | 2 | 3 | 4 | 5 | r2 | adj_r2 | cp | press | abs_press | r2_pred | aic | bic |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | 1 | 1 | 0 | 1 | 0.990 | 0.988 | 2.92 | 1.78e+07 | 12019 | 0.964 | 222 | 225 |
3 | 1 | 1 | 0 | 0 | 1 | 0.989 | 0.987 | 3.71 | 1.81e+07 | 12163 | 0.964 | 223 | 226 |
3 | 1 | 1 | 1 | 0 | 0 | 0.987 | 0.984 | 6.21 | 2.28e+07 | 14243 | 0.954 | 226 | 229 |
3 | 0 | 1 | 1 | 1 | 0 | 0.987 | 0.984 | 6.90 | 3.01e+07 | 15780 | 0.939 | 227 | 230 |
3 | 1 | 1 | 0 | 1 | 0 | 0.986 | 0.983 | 7.66 | 3.28e+07 | 16227 | 0.934 | 227 | 231 |
3 | 1 | 0 | 1 | 0 | 1 | 0.985 | 0.982 | 8.97 | 1.30e+07 | 11104 | 0.974 | 229 | 232 |
3 | 0 | 0 | 1 | 1 | 1 | 0.985 | 0.982 | 9.00 | 1.63e+07 | 11801 | 0.967 | 229 | 232 |
3 | 1 | 0 | 0 | 1 | 1 | 0.985 | 0.981 | 9.41 | 1.78e+07 | 12296 | 0.964 | 229 | 232 |
3 | 1 | 0 | 1 | 1 | 0 | 0.978 | 0.974 | 16.80 | 3.44e+07 | 18046 | 0.930 | 235 | 238 |
3 | 0 | 1 | 0 | 1 | 1 | 0.952 | 0.941 | 48.28 | 1.07e+08 | 26242 | 0.784 | 248 | 252 |
4 | 0 | 1 | 1 | 1 | 1 | 0.991 | 0.988 | 4.03 | 2.86e+07 | 14606 | 0.942 | 222 | 227 |
4 | 1 | 1 | 0 | 1 | 1 | 0.991 | 0.987 | 4.26 | 3.03e+07 | 14992 | 0.939 | 223 | 227 |
4 | 1 | 1 | 1 | 0 | 1 | 0.991 | 0.987 | 4.35 | 2.25e+07 | 13364 | 0.955 | 223 | 227 |
4 | 1 | 1 | 1 | 1 | 0 | 0.988 | 0.984 | 7.54 | 3.77e+07 | 17952 | 0.924 | 227 | 231 |
4 | 1 | 0 | 1 | 1 | 1 | 0.985 | 0.980 | 10.92 | 1.86e+07 | 12773 | 0.962 | 231 | 235 |
5 | 1 | 1 | 1 | 1 | 1 | 0.991 | 0.987 | 6.00 | 3.22e+07 | 16025 | 0.935 | 224 | 229 |
Scale Cp, AIC, BIC to \([0, 1]\).
ols_step_forward_p()
Selection Summary
---------------------------------------------------------------------------
Variable Adj.
Step Entered R-Square R-Square C(p) AIC RMSE
---------------------------------------------------------------------------
1 x3 0.9722 0.9703 20.3812 285.5158 957.8556
2 x2 0.9867 0.9848 4.9416 274.9519 685.1685
3 x5 0.9901 0.9878 2.9177 272.0064 614.7794
---------------------------------------------------------------------------
ols_step_forward_p()
Call:
lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
data = l)
Coefficients:
(Intercept) x3 x2 x5
1523.389 0.978 0.053 -320.951
progress = TRUE
, details = TRUE
for detailed selection progress.ols_step_both_p()
Stepwise Selection Summary
---------------------------------------------------------------------------------------
Added/ Adj.
Step Variable Removed R-Square R-Square C(p) AIC RMSE
---------------------------------------------------------------------------------------
1 x3 addition 0.972 0.970 20.3810 285.5158 957.8556
2 x2 addition 0.987 0.985 4.9420 274.9519 685.1685
3 x5 addition 0.990 0.988 2.9180 272.0064 614.7794
---------------------------------------------------------------------------------------
Comments on Stepwise-Type Procedures
olsrr::ols_step_backward_p()
olsrr::ols_step_forward_aic()
olsrr::ols_step_both_aic()
olsrr::ols_step_backward_aic()
The order in which the regressors enter or leave the model does not imply an order of importance to the regressors.
No one model may be the “best”.
Different stepwise techniques could result in different models.
math4780-f23.github.io/website