MATH 4780 / MSSC 5780 Regression Analysis
With categorical variable Gender
Relate the effective life (hours) of a cutting tool \((y)\) used on a lathe to
the lathe speed in revolutions per minute \((x_1)\) (Numerical)
the type of cutting tool used \((x_2)\) (Categorical)
\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \epsilon\)
\(\hat{y} = b_0 + b_1 x_1 + b_2 x_2\)
character
or factor
.'data.frame': 20 obs. of 3 variables:
$ hours: num 18.7 14.5 17.4 14.5 13.4 ...
$ speed: num 610 950 720 840 980 530 680 540 890 730 ...
$ type : chr "A" "A" "A" "A" ...
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.986 3.5104 10.5 7.2e-09
speed -0.027 0.0045 -5.9 1.8e-05
typeB 15.004 1.3597 11.0 3.6e-09
Same variance of the errors for both A and B?
Approximately normal
If performing well, the single-model approach with dummy variables is preferred.
\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_1x_2+\epsilon\)
\(\hat{y} = b_0 + b_1 x_1 + b_2x_2 + b_3 x_1 x_2\)
How do we test if the 2 regressions are identical?
Call:
lm(formula = hours ~ speed + type + speed:type, data = tool_data)
Coefficients:
(Intercept) speed typeB speed:typeB
32.7748 -0.0210 23.9706 -0.0119
\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \epsilon\)
\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_1x_2+\epsilon\)
Analysis of Variance Table
Model 1: hours ~ speed
Model 2: hours ~ speed * type
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 1282
2 16 141 2 1141 64.8 2.1e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(F_{test} = \frac{SS_R(\beta_2, \beta_3 |\beta_1, \beta_0)/2}{MS_{res}} = \frac{1141/2}{141/16} = 64.75 > F_{\alpha, 2, 20-4}\)
The two regression lines are not identical.
Tool Type | \(x_2\) | \(x_3\) |
---|---|---|
A | 0 | 0 |
B | 1 | 0 |
C | 0 | 1 |
Type of Air Conditioning | \(x_2\) | \(x_3\) | \(x_4\) |
---|---|---|---|
No air conditioning | 0 | 0 | 0 |
Window units | 1 | 0 | 0 |
Heat pump | 0 | 1 | 0 |
Central air conditioning | 0 | 0 | 1 |
Which type is the baseline level?
The regression model is \(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \epsilon\)
“No air conditioning” is the baseline level.
Type of Air Conditioning | \(x_2\) | \(x_3\) | \(x_4\) |
---|---|---|---|
No air conditioning | 0 | 0 | 0 |
Window units | 1 | 0 | 0 |
Heat pump | 0 | 1 | 0 |
Central air conditioning | 0 | 0 | 1 |
If the house has
\[y = \beta_0+\beta_1x_1 + \epsilon\]
\[y = (\beta_0 + \beta_2)+\beta_1x_1 + \epsilon\]
\[y = (\beta_0 + \beta_3) +\beta_1x_1 + \epsilon\]
\[y = (\beta_0 + \beta_4) +\beta_1x_1 + \epsilon\]
Do you think the model \(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \epsilon\) is reasonable?
\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_1x_2 + \beta_6 x_1x_3 + \beta_7 x_1x_4 + \epsilon\)
“No air conditioning” is the baseline level.
Type of Air Conditioning | \(x_2\) | \(x_3\) | \(x_4\) |
---|---|---|---|
No air conditioning | 0 | 0 | 0 |
Window units | 1 | 0 | 0 |
Heat pump | 0 | 1 | 0 |
Central air conditioning | 0 | 0 | 1 |
No air conditioning: \[y = \beta_0+\beta_1x_1 + \epsilon\]
Window units: \[y = (\beta_0 + \beta_2)+(\beta_1+\beta_5)x_1 + \epsilon\]
Heat pump: \[y = (\beta_0 + \beta_3) +(\beta_1+\beta_6)x_1 + \epsilon\]
Central air conditioning: \[y = (\beta_0 + \beta_4) +(\beta_1+\beta_7)x_1 + \epsilon\]
❗ The model has the same expression as the model with only one categorical variable having 3 categories. But the meaning is totally different!
Tool Type | Cutting Oil | Regression Model |
---|---|---|
A \(\small (x_2 = 0)\) | Low-viscosity \(\small (x_3 = 0)\) | \(\small y = \beta_0+\beta_1x_1 + \epsilon\) |
B \(\small (x_2 = 1)\) | Low-viscosity \(\small (x_3 = 0)\) | \(\small y = (\beta_0+ \beta_2) + (\beta_1+\beta_4)x_1 + \epsilon\) |
A \(\small (x_2 = 0)\) | Medium-viscosity \(\small (x_3 = 1)\) | \(\small y = (\beta_0+ \beta_3) + (\beta_1+\beta_5)x_1 + \epsilon\) |
B \(\small (x_2 = 1)\) | Medium-viscosity \(\small (x_3 = 1)\) | \(\small y = (\beta_0+ \beta_2 + \beta_3) + (\beta_1+\beta_4 + \beta_5)x_1 + \epsilon\) |
Tool Type | Cutting Oil | Regression Model |
---|---|---|
A \(\small (x_2 = 0)\) | Low-viscosity \(\small (x_3 = 0)\) | \(\small y = \beta_0+\beta_1x_1 + \epsilon\) |
B \(\small (x_2 = 1)\) | Low-viscosity \(\small (x_3 = 0)\) | \(\small y = (\beta_0+ \beta_2) + (\beta_1+\beta_4)x_1 + \epsilon\) |
A \(\small (x_2 = 0)\) | Medium-viscosity \(\small (x_3 = 1)\) | \(\small y = (\beta_0+ \beta_3) + (\beta_1+\beta_5)x_1 + \epsilon\) |
B \(\small (x_2 = 1)\) | Medium-viscosity \(\small (x_3 = 1)\) | \(\small y = (\beta_0+ \beta_2 + \beta_3 + \beta_6) + (\beta_1+\beta_4 + \beta_5)x_1 + \epsilon\) |
All \(M\) slopes are identical \(H_0: \beta_{11} = \beta_{12} = \cdots = \beta_{1M} = \beta_1\)
Full Model \((F)\): \(y = \beta_{0m} + \beta_{1m}x + \epsilon, \quad m = 1, 2, \dots, M\).
Reduced Model \((R)\): \(y = \beta_0 + \beta_1x + \color{blue}{\beta_2D_1 + \beta_3D_2 + \cdots + \beta_{M-1}D_{M-1}}+\epsilon\), where \(D_1, \dots, D_{M-1}\) are dummies.