MATH 4780 / MSSC 5780 Regression Analysis
For the \(i\)-th measurement in the target population, \[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\]
When we collect data \(\{ (x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\},\) \(y\) is assumed drawn from a normal distribution. Its value varies around its mean \(\mu_y\).
When we collect data \(\{ (x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\},\) \(y\) is assumed drawn from a normal distribution. Its value varies around its mean \(\mu_y\).
\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i\) \(\quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
For a random variable \(Z\) and a constant \(c \in \mathbf{R}\), \(E(c+Z) = E(c) + E(Z) = c + E(Z)\).
\[\begin{align*} \mu_{Y_i \mid X_i} = E(Y_i \mid X_i) &= E(\beta_0 + \beta_1X_i + \epsilon_i) \\ &= \beta_0 + \beta_1X_i + E(\epsilon_i) \\ &= \beta_0 + \beta_1X_i \end{align*}\]
The mean response of \(Y\), \(\mu_{Y\mid X} = E(Y\mid X)\), has a straight-line relationship with \(X\) given by the population regression line \[\mu_{Y\mid X} = \beta_0 + \beta_1X\]
\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i\) \(\quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
For a random variable \(Z\) and a constant \(c \in \mathbf{R}\), \(\mathrm{Var}(c+Z) = \mathrm{Var}(Z)\).
\[\begin{align*} \mathrm{Var}(Y_i \mid X_i) &= \mathrm{Var}(\beta_0 + \beta_1X_i + \epsilon_i) \\ &= \mathrm{Var}(\epsilon_i) = \sigma^2 \end{align*}\] The variance of \(Y\) does not depend on \(X\).
\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i\); \(\quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
For a random variable \(Z \sim N(\mu, \sigma^2)\) and a constant \(c \in \mathbf{R}\), \(c+Z \sim N(c + \mu, \sigma^2)\).
\[\begin{align*} Y_i \mid X_i \stackrel{indep}{\sim} N(\beta_0 + \beta_1X_i, \sigma^2) \end{align*}\] For any fixed value of \(X_i = x\), the response \(Y_i\) varies according to \(N(\mu_{Y_i\mid X_i = x}, \sigma^2)\).
Job: Collect data and estimate the unknown \(\beta_0\), \(\beta_1\) and \(\sigma^2\)!
\[\begin{align*} y_i &= f(x_i) + \epsilon_i \\ &= \beta_0 + \beta_1~x_{i} + \epsilon_i, \end{align*}\] or
\[E(y_i \mid x_i) = \mu_{y|x_i} = \beta_0 + \beta_1~x_{i}\]
Use sample statistics \(b_0\) and \(b_1\) computed from our sample data to estimate \(\beta_0\) and \(\beta_1\).
\(\hat{y}_{i} = b_0 + b_1~x_{i}\) is called fitted value of \(y_i\), a point estimate of the mean \(\mu_{y|x_i}\) and \(y_i\) itself.
Given the training sample \(\{ (x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\},\)
Choose the best \(b_0\) and \(b_1\) or the sample regression line minimizing the sum of squared residuals \[SS_{res} = e_1^2 + e_2^2 + \dots + e_n^2 = \sum_{i = 1}^n e_i^2.\]
\[\begin{align} SS_{res} &= (y_1 - b_0 - b_1x_1)^2 + (y_2 - b_0 - b_1x_2)^2 + \dots + (y_n - b_0 - b_1x_n)^2\\ &= \sum_{i=1}^n(y_i - b_0 - b_1x_i)^2 \end{align}\] that is the smallest comparing to any other \(SS_{res} = \sum_{i=1}^n(y_i - a_0 - a_1x_i)^2\) that uses another pair of estimators \((a_0, a_1) \ne (b_0, b_1)\).
Take derivative w.r.t. \(\alpha_0\) and \(\alpha_1\), setting both equal to zero: \[\left.\frac{\partial SS_{res}}{\partial\alpha_0}\right\vert_{b_0, b_1} = \left.\sum_{i=1}^n\frac{\partial (y_i - \alpha_0 - \alpha_1x_i)^2}{\partial\alpha_0}\right\vert_{b_0, b_1} = -2 \sum_{i=1}^n(y_i - b_0 - b_1x_i) = 0\] \[\left. \frac{\partial SS_{res}}{\partial\alpha_1}\right\vert_{b_0, b_1} = \left.\sum_{i=1}^n\frac{\partial (y_i - \alpha_0 - \alpha_1x_i)^2}{\partial\alpha_1}\right\vert_{b_0, b_1} = -2 \sum_{i=1}^nx_i(y_i - b_0 - b_1x_i) = 0\] The two equations are called the normal equations.
\[ \color{red}{b_0 = \overline{y} - b_1\overline{x}}\]
\[\color{red}{b_1 = \frac{\sum_{i=1}^n(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n(x_i - \overline{x})^2} = \frac{S_{xy}}{S_{xx}} = r \frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}}},\] where \(S_{xx} = \sum_{i=1}^n(x_i - \overline{x})^2\), \(S_{yy} = \sum_{i=1}^n(y_i - \overline{y})^2\), \(S_{xy} = \sum_{i=1}^n(x_i - \overline{x})(y_i - \overline{y})\), and \(r\) is the sample correlation coefficient between \(x\) and \(y\).
What can we learn from the formula of \(b_0\) and \(b_1\)?
hwy
vs. Displacement displ
mpg
in ggplot2
package.# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
# ℹ 228 more rows
plot(x = mpg$displ, y = mpg$hwy, las = 1, pch = 19, col = "navy", cex = 0.5,
xlab = "Displacement (litres)", ylab = "Highway MPG",
main = "Highway MPG vs. Engine Displacement (litres)")
(reg_fit <- lm(formula = hwy ~ displ, data = mpg))
Call:
lm(formula = hwy ~ displ, data = mpg)
Coefficients:
(Intercept) displ
35.70 -3.53
typeof(reg_fit)
[1] "list"
## use $ to extract an element of a list
reg_fit$coefficients
(Intercept) displ
35.70 -3.53
\(\widehat{hwy}_{i} = b_0 + b_1 \times displ_{i} = 35.698 -3.531 \times displ_{i}\)
\(b_1\): For one unit (litre) increase of the displacement, we expect the highway MPG to be decreased, on average, by 3.5 miles.
## all elements in reg_fit
names(reg_fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
mpg$hwy[1:5]
[1] 29 29 31 30 26
## the first 20 fitted value y_hat
head(reg_fit$fitted.values, 5)
1 2 3 4 5
29.3 29.3 28.6 28.6 25.8
mpg$displ[1:5]
[1] 1.8 1.8 2.0 2.0 2.8
length(reg_fit$fitted.values)
[1] 234
Both \(b_0\) and \(b_1\) are linear combinations of \(y_1, \dots, y_n\).
Therefore, \(\hat{y}_i\) is a linear combination of \(y_1, \dots, y_n\)!
\[\begin{align} \hat{y}_i &= b_0 + b_1 ~x_i \\&= \overline{y} - b_1~\overline{x} + x_i\sum_{j=1}^n c_j y_j \\&= \cdots \\&= \sum_{j=1}^nh_{ij}y_j \\&= h_{i1}y_1 + h_{i2}y_2 + \cdots + h_{in}y_n\end{align}\] where \(h_{ij} = \frac{1}{n} + \frac{(x_i-\overline{x})(x_j-\overline{x})}{S_{xx}}.\)
If \(h_{ij}\) is large, the \(j\)th case can have substantial impact on the \(i\)th fitted value.
Responses \(\{y_i\}_{i=1}^n\) are random variables before we actually collect them. So are \(b_0\) and \(b_1\).
\(b_0\) and \(b_1\) are unbiased estimators for \(\beta_0\) and \(\beta_1\), respectively, i.e., \(E(b_0) = \beta_0\), \(E(b_1) = \beta_1\).
For random variables \(Z_1, \dots, Z_n\) and constants \(c_1, \dots, c_n \in \mathbf{R}\), \(E(c_1Z_1 + \dots +c_nZ_n) = c_1E(Z_1) + \dots + c_nE(Z_n)\)
\[\begin{align*} E(b_1) &= E\left( \sum_{i=1}^n c_iy_i\right) = \sum_{i=1}^n c_i E(y_i) = \sum_{i=1}^n c_i (\beta_0+\beta_1x_i) \\ &= \beta_0 \sum_{i=1}^n c_i+ \beta_1\sum_{i=1}^n c_ix_i = \beta_1 \end{align*}\]
For independent random variables \(Z_1, \dots, Z_n\) and constants \(c_1, \dots, c_n \in \mathbf{R}\), \(\mathrm{Var}(c_1Z_1 + \dots +c_nZ_n) = c_1^2\mathrm{Var}(Z_1) + \dots + c_n^2\mathrm{Var}(Z_n)\)
\[\begin{align*} \mathrm{Var}(b_1) &= \mathrm{Var}\left( \sum_{i=1}^n c_iy_i\right) = \sum_{i=1}^n c_i^2 \mathrm{Var}(y_i) = \sigma^2\sum_{i=1}^n c_i^2 = \frac{\sigma^2\sum_{i=1}^n (x_i - \overline{x})^2}{S_{xx}^2} = \frac{\sigma^2}{S_{xx}} \end{align*}\]
👉 Sum of residuals is zero.
\[\scriptstyle \sum_{i=1}^n(y_i - \hat{y}_i) = \sum_{i=1}^ne_i = 0\]
👉 The sum of observations \(y_i\) equals the sum of the fitted values \(\hat{y}_i\).
\[\scriptstyle \sum_{i=1}^ny_i = \sum_{i=1}^n\hat{y}_i\]
👉 The LS regression line passes through the centroid of data \((\overline{x}, \overline{y})\).
👉 Inner product of residual and predictor is zero. \[\scriptstyle \sum_{i=1}^nx_ie_i = 0\]
👉 Inner product of residual and fitted value is zero. \[\scriptstyle \sum_{i=1}^n\hat{y}_ie_i = 0\]
(summ_reg_fit <- summary(reg_fit))
Call:
lm(formula = hwy ~ displ, data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.104 -2.165 -0.224 2.059 15.010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.698 0.720 49.5 <2e-16 ***
displ -3.531 0.195 -18.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.84 on 232 degrees of freedom
Multiple R-squared: 0.587, Adjusted R-squared: 0.585
F-statistic: 329 on 1 and 232 DF, p-value: <2e-16
# lots of fitted information saved in summary(reg_fit)!
names(summ_reg_fit)
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
# residual standard error (sigma_hat)
summ_reg_fit$sigma
[1] 3.84
[1] 3.84
The inference methods requires the model assumptions to be satisfied!
If \(y_i\) ( given \(x_i\) ) is normally distributed, do \(b_0\) and \(b_1\) follow normal distribution too?
If \(Z_1, \dots, Z_n\) are normal variables and constants \(c_1, \dots, c_n \in \mathbf{R}\), then \(c_1Z_1 + \dots + c_nZ_n\) is also a normal variable.
## Confidence interval for beta_0 and beta_1
confint(reg_fit, level = 0.95)
2.5 % 97.5 %
(Intercept) 34.28 37.12
displ -3.91 -3.15
## Confidence interval for sigma^2 (no built-in function)
alpha <- 0.05
SS_res <- sum(reg_fit$residuals ^ 2)
low_bd <- SS_res / qchisq(alpha / 2, df = reg_fit$df, lower.tail = FALSE)
upp_bd <- SS_res / qchisq(alpha / 2, df = reg_fit$df, lower.tail = TRUE)
low_bd
[1] 12.4
upp_bd
[1] 17.8
# MS_res (sigma_hat^2)
summ_reg_fit$sigma ^ 2
[1] 14.7
summ_reg_fit
...
lm(formula = hwy ~ displ, data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.104 -2.165 -0.224 2.059 15.010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.698 0.720 49.5 <2e-16 ***
displ -3.531 0.195 -18.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
...
summ_reg_fit$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.70 0.720 49.6 2.12e-125
displ -3.53 0.195 -18.2 2.04e-46
If we reject \(H_0: \beta_1 = 0\), does it mean \(X\) and \(Y\) are linearly related?
Suppose we only have data \(Y\) and have no information about \(X\) or no information about the relationship between \(X\) and \(Y\). How do we predict a value of \(Y\)?
Our best guess would be \(\overline{y}\) if the data have no pattern, i.e., \(\hat{y}_i = \overline{y}\).
As we were treating \(X\) and \(Y\) as uncorrelated.
The (total) deviation from the mean is \((y_i - \overline{y})\).
If \(X\) and \(Y\) are linearly related, fitting a linear regression helps us predict the value of \(Y\) when the value of \(X\) is provided.
\(\hat{y}_i = b_0 + b_1x_i\) is closer to \(y_i\) than \(\overline{y}\).
The regression model explains some deviation of \(y\).
Total deviation = Deviation explained by regression + Unexplained deviation
\((y_i - \overline{y}) = (\hat{y}_i - \overline{y}) + (y_i - \hat{y}_i)\)
\((19 - 9) = (13 - 9) + (19 - 13)\)
What is the \(H_0\) of ANOVA \(F\)-test if there are \(k \ge 2\) predictors?
anova(reg_fit)
Analysis of Variance Table
Response: hwy
Df Sum Sq Mean Sq F value Pr(>F)
displ 1 4848 4848 329 <2e-16 ***
Residuals 232 3414 15
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summ_reg_fit$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.70 0.720 49.6 2.12e-125
displ -3.53 0.195 -18.2 2.04e-46
summ_reg_fit$coefficients[2, 3] ^ 2
[1] 329
The coefficient of determination \((R^2)\) is the proportion of the variation in \(y\) that is explained by the regression model: \[R^2 = \frac{SS_R}{SS_T} =\frac{SS_T - SS_{res}}{SS_T} = 1 - \frac{SS_{res}}{SS_T}\]
\(R^2\) as the proportionate reduction of total variation associated with the use of \(X\).
(a) \(\hat{y}_i = y_i\) and \(\small SS_{res} = \sum_{i=1}^n(y_i - \hat{y}_i)^2 = 0\). (b) \(\hat{y}_i = \overline{y}\) and \(\small SS_R = \sum_{i=1}^n(\hat{y}_i - \overline{y})^2 = 0\).
summ_reg_fit
...
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.698 0.720 49.5 <2e-16 ***
displ -3.531 0.195 -18.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.84 on 232 degrees of freedom
Multiple R-squared: 0.587, Adjusted R-squared: 0.585
F-statistic: 329 on 1 and 232 DF, p-value: <2e-16
...
summ_reg_fit$r.squared
[1] 0.587
With predictor value \(x = x_0\), we want to estimate the mean response \[E(y\mid x_0) = \mu_{y|x_0} = \beta_0 + \beta_1 x_0\].
If \(x_0\) is within the range of \(x\), an unbiased point estimate of \(E(y\mid x_0)\) is \[\widehat{E(y\mid x_0)} = \hat{\mu}_{y | x_0} = b_0 + b_1 x_0\]
The sampling distribution of \(\hat{\mu}_{y | x_0}\) is \[N\left( \mu_{y | x_0} = \beta_0 + \beta_1 x_0, \sigma^2\left(\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}} \right) \right)\]
\((\hat{\mu}_{y | x_0} = b_0 + b_1 x_0) \sim N\left( \mu_{y | x_0} = \beta_0 + \beta_1 x_0, \sigma^2\left(\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}} \right) \right)\)
\[\frac{(b_0 + b_1 x_0) - (\beta_0 + \beta_1 x_0)}{\sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}} \sim N(0, 1)\]
\[\frac{(b_0 + b_1 x_0) - (\beta_0 + \beta_1 x_0)}{s\sqrt{\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}} \sim t_{n-2}\]
The \((1-\alpha)100\%\) CI for \(E(y\mid x_0)\) is \(\boxed{\hat{\mu}_{y | x_0} \pm t_{\alpha/2, n-2} s\sqrt{\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}}\).
Does the length of the CI for \(E(y\mid x_0)\) stay the same at any location of \(x_0\)?
What is the sampling distribution of \(\hat{y}_0\)?
What is the distribution of \(y_0 = \beta_0 + \beta_1x_0 + \epsilon\)?
What is the distribution of \(y_0 - \hat{y}_0\)?
\[\frac{(y_0 - \hat{y}_0) - \color{red}{0}}{\sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}} \sim N(0, 1)\]
\[\frac{y_0 - \hat{y}_0}{s\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}} \sim t_{n-2}\]
The \((1-\alpha)100\%\) prediction interval (PI) for \(y_0(x_0)\) is \(\small \boxed{\hat{y_0} \pm t_{\alpha/2, n-2} s\sqrt{1+ \frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}}\)
What is the difference between CI for \(E(y\mid x_0)\) and PI for \(y_0(x_0)\)?
## CI for the mean response
predict(reg_fit, newdata = data.frame(displ = 5.5), interval = "confidence", level = 0.95)
fit lwr upr
1 16.3 15.4 17.2
## PI for the new observation
predict(reg_fit, newdata = data.frame(displ = 5.5), interval = "predict", level = 0.95)
fit lwr upr
1 16.3 8.67 23.9
What is the Bernoulli distribution?
\[L(\theta) = \prod_{i = 1}^nf(y_i|\theta) = \prod_{i = 1}^n\theta^{y_i}(1-\theta)^{1-y_i} = \theta^{\sum_{i=1}^ny_i}(1-\theta)^{n-\sum_{i=1}^ny_i}\]
Suppose \(n = 20\) and \(\sum_{i=1}^{20}y_i = 12\) from the sample data
Fist order condition: \(\frac{d \, \log L(\theta)}{d \, \theta} = \frac{12}{\theta} - \frac{8}{1-\theta} = 0\).
\(\hat{\theta}_{ML} = 12/20 = 0.6\).
\[\begin{align} \small \quad \quad \left.\frac{\partial \ell}{\partial\beta_0}\right\vert_{\tilde{\beta}_0, \tilde{\beta}_1, \tilde{\sigma}^2} & = 0\\ \small \quad \quad \left. \frac{\partial \ell}{\partial\beta_1}\right\vert_{\tilde{\beta}_0, \tilde{\beta}_1, \tilde{\sigma}^2} &= 0\\ \small \quad \quad \left. \frac{\partial \ell}{\partial\sigma^2}\right\vert_{\tilde{\beta}_0, \tilde{\beta}_1, \tilde{\sigma}^2} &= 0 \end{align}\]