MATH 4780 / MSSC 5780 Regression Analysis
Fit three separate independent SLR models:
ā Fitting a separate SLR model for each predictor is not satisfactory.
I hope you donāt feelā¦
What I hope isā¦
How many parameters are there in the model?
\[\begin{align} y_i &= \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_k x_{ik} + \epsilon_i \\ &= \beta_0 + \sum_{j=1}^k\beta_j x_{ij} + \epsilon_i, \quad i = 1, 2, \dots, n \end{align}\]
\[\begin{align} \left.\frac{\partial S}{\partial\alpha_0}\right\vert_{b_0, b_1, \dots, b_k} &= -2 \sum_{i=1}^n\left(y_i - b_0 - \sum_{j=1}^k b_j x_{ij}\right) = 0\\ \left.\frac{\partial S}{\partial\alpha_j}\right\vert_{b_0, b_1, \dots, b_k} &= -2 \sum_{i=1}^n\left(y_i - b_0 - \sum_{j=1}^k b_j x_{ij}\right)x_{ij} = 0, \quad j = 1, 2, \dots, k \end{align}\]
š¹ šŗ šø š„ I buy you a drink if you solve the equations for \(k \ge 2\) by hand without using matrix notations or operations!
# Load the data set
delivery <- read.csv(file = "./data/data-ex-3-1.csv",
header = TRUE)
delivery_data <- delivery[, -1]
colnames(delivery_data) <- c("time", "cases", "distance")
str(delivery_data)
'data.frame': 25 obs. of 3 variables:
$ time : num 16.7 11.5 12 14.9 13.8 ...
$ cases : int 7 3 3 4 6 7 2 7 30 5 ...
$ distance: int 560 220 340 80 150 330 110 210 1460 605 ...
delivery_data
time cases distance
1 16.7 7 560
2 11.5 3 220
3 12.0 3 340
4 14.9 4 80
5 13.8 6 150
6 18.1 7 330
7 8.0 2 110
8 17.8 7 210
9 79.2 30 1460
10 21.5 5 605
11 40.3 16 688
12 21.0 10 215
13 13.5 4 255
14 19.8 6 462
15 24.0 9 448
16 29.0 10 776
17 15.3 6 200
18 19.0 7 132
19 9.5 3 36
20 35.1 17 770
21 17.9 10 140
22 52.3 26 810
23 18.8 9 450
24 19.8 8 635
25 10.8 4 150
pairs(delivery_data)
par(mgp = c(2, 0.8, 0), las = 1, mar = c(4, 4, 0, 0))
library(scatterplot3d)
scatterplot3d(x = delivery_data$cases, y = delivery_data$distance, z = delivery_data$time,
xlab ="cases", ylab = "distance", zlab = "time",
xlim = c(2, 30), ylim = c(36, 1640), zlim = c(8, 80),
box = TRUE, color = "blue", mar = c(3, 3, 0, 2), angle = 30, pch = 16)
delivery_lm <- lm(time ~ cases + distance, data = delivery_data)
delivery_lm$coef
(Intercept) cases distance
2.341 1.616 0.014
\[\hat{y} = 2.34 + 1.62x_1 + 0.014x_2\]
\(b_1\): All else held constant, for one case of product stocked increase, we expect the delivery time to be longer, on average, by 1.62 minutes.
\(b_2\): All else held constant, one additional foot walked by the driver causes the delivery time, on average, to be 0.014 minutes longer.
\(b_0\): The delivery time with no number of cases of product stocked and no distance walked by the driver is expected to be 2.34 minutes. (Make sense?!)
\(SS_{res} = \sum_{i=1}^ne_i^2 = \sum_{i=1}^n(y_i - \hat{y}_i)^2\).
\(MS_{res} = \frac{SS_{res}}{n - p}\) with \(p = k + 1\).
\(S^2 = MS_{res}\) is unbiased for \(\sigma^2\), i.e., \(E[MS_{res}] = \sigma^2\).
\(S^2\) of SLR may be quite larger than the \(S^2\) of MLR.
\(S^2\) measures the variation of the unexplained noise about the fitted regression line/hyperplane, so we prefer a small residual mean square.
## method 1
summ_delivery <- summary(delivery_lm) ## check names(summ_delivery)
summ_delivery$sigma ^ 2
[1] 11
## method 3
(SS_res1 <- sum((delivery_data$time - delivery_lm$fitted.values) ^ 2))
[1] 234
SS_res1 / (n - 3)
[1] 11
The \((1-\alpha)100\%\) Wald CI for \(\beta_j\), \(j = 0, 1, \dots, k\) is \[\left(b_j- t_{\alpha/2, n-p}~se(b_j), \quad b_j + t_{\alpha/2, n-p}~ se(b_j)\right)\]
(ci <- confint(delivery_lm))
2.5 % 97.5 %
(Intercept) 0.0668 4.616
cases 1.2618 1.970
distance 0.0069 0.022
## correlation matrix
cov2cor(V)
(Intercept) cases distance
(Intercept) 1.00 -0.25 -0.22
cases -0.25 1.00 -0.82
distance -0.22 -0.82 1.00
\(b_1\) and \(b_2\) are negatively correlated.
Individual CI ignores the correlation between \(b_j\)s.
How do we specify a confidence level that applies simultaneously to a set of interval estimates? For example, a \(95\%\) confidence āintervalā for both \(b_1\) and \(b_2\).
par(mgp = c(2.8, 0.9, 0), mar = c(4, 4, 2, 0))
## confidence region
car::confidenceEllipse(
delivery_lm,
levels = 0.95,
which.coef = c("cases", "distance"),
main = expression(
paste("95% Confidence Region for ",
beta[1], " and ", beta[2])
)
)
## marginal CI for cases
abline(v = ci[2, ], lty = 2, lwd = 2)
## marginal CI for distance
abline(h = ci[3, ], lty = 2, lwd = 2)
points(x = 1.4, y = 0.01, col = "red", cex = 2, pch = 16)
points(x = 2, y = 0.008, col = "black", cex = 2, pch = 16)
predict(delivery_lm,
newdata = data.frame(cases = 8, distance = 275),
interval = "confidence", level = 0.95)
fit lwr upr
1 19 18 21
predict(delivery_lm,
newdata = data.frame(cases = 8, distance = 275),
interval = "predict", level = 0.95)
fit lwr upr
1 19 12 26
A complete picture of the regression surface requires drawing a \(p\)-dimensional graph.
Predictor effect plots look at 1 or 2D plots for each predictor.
Same slope for any choice of fixed values of other predictors.
The intercepts depend on the values of other predictors.
library(effects)
plot(effects::predictorEffects(mod = delivery_lm))
Source of Variation | SS | df | MS | F | \(p\)-value |
---|---|---|---|---|---|
Regression | \(SS_R\) | \(k\) | \(MS_R\) | \(\frac{MS_R}{MS_{res}} = F_{test}\) | \(P(F_{k, n-k-1} > F_{test})\) |
Residual | \(SS_{res}\) | \(n-k-1\) | \(MS_{res}\) | ||
Total | \(SS_{T}\) | \(n-1\) |
\(H_0: \beta_{1} = \beta_{2} = 0 \quad H_1: \beta_j \ne 0 \text{ for at least one } j\)
summ_delivery
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.34123 1.09673 2.13 0.04417 *
cases 1.61591 0.17073 9.46 3.3e-09 ***
distance 0.01438 0.00361 3.98 0.00063 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.3 on 22 degrees of freedom
Multiple R-squared: 0.96, Adjusted R-squared: 0.956
F-statistic: 261 on 2 and 22 DF, p-value: 4.69e-16
...
Source of Variation | SS | df | MS | F | \(p\)-value |
---|---|---|---|---|---|
Regression | \(SS_R\) | \(k\) | \(MS_R\) | \(\frac{MS_R}{MS_{res}} = F_{test}\) | \(P(F_{k, n-k-1} > F_{test})\) |
Residual | \(SS_{res}\) | \(n-k-1\) | \(MS_{res}\) | ||
Total | \(SS_{T}\) | \(n-1\) |
anova(delivery_lm) ## This is for sequential F-test
Analysis of Variance Table
Response: time
Df Sum Sq Mean Sq F value Pr(>F)
cases 1 5382 5382 506.6 < 2e-16 ***
distance 1 168 168 15.8 0.00063 ***
Residuals 22 234 11
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Testing coefficients is like model comparison: Compare the full model with the model under \(H_0\)!
cases
and distance
predictorsOccamās Razor: Donāt use a complex model if a simpler model can perform equally well!
Intuition: The new added regressor should have explanatory power for \(y\) large enough, so that \(MS_{res}\) is decreased.
summ_delivery
...
Residual standard error: 3.3 on 22 degrees of freedom
Multiple R-squared: 0.96, Adjusted R-squared: 0.956
...
summ_delivery$r.squared
[1] 0.96
summ_delivery$adj.r.squared
[1] 0.96
summ_delivery$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.341 1.0967 2.1 4.4e-02
cases 1.616 0.1707 9.5 3.3e-09
distance 0.014 0.0036 4.0 6.3e-04