MATH 4780 - Fall 2023 - Categorical Variables r emo::ji('hammer_and

Categorical Variables

Examine the relationship between numerical response and categorical predictors.
- Gender (Female 👩, Male 👨, Other 🏳️‍🌈) : Gender income/wage gap
- Country (USA 🇺🇸, Canada 🇨🇦, UK 🇬🇧, Germany 🇩🇪, Japan 🇯🇵, Korea 🇰🇷) : Meat consumption level
- Political Party (Republican 🔴, Democratic 🔵, Other ⚫) : Donation to healthcare

Categorical Variable in Regression

With categorical variable Gender

Inappropriate height-weight relationship if gender factor is ignored.
The two groups have different \(\beta_0\) and \(\beta_1\).

Example 8.1 Tool Life Data (LRA)

Relate the effective life (hours) of a cutting tool \((y)\) used on a lathe to

the lathe speed in revolutions per minute \((x_1)\) (Numerical)
the type of cutting tool used \((x_2)\) (Categorical)

   hours speed type
1     19   610    A
2     15   950    A
3     17   720    A
4     15   840    A
5     13   980    A
6     24   530    A
7     13   680    A
8     23   540    A
9     13   890    A
10    19   730    A
11    30   670    B
12    27   770    B
13    25   880    B
14    26  1000    B
15    33   760    B
16    36   590    B
17    26   910    B
18    37   650    B
19    35   810    B
20    44   500    B

Indicator Variable

Tool type can be represented as: \[x_2 = \begin{cases} 0 & \quad \text{Tool type A}\\ 1 & \quad \text{Tool type B} \end{cases}\] where \(x_2\) is a dummy variable.
If a first-order model is appropriate: \[y = \beta_0+\beta_1x_1+\beta_2x_2 + \epsilon, \quad \epsilon \sim N(0, \sigma^2)\]
Assume that the variance is the same for both levels (type A and B).

When we label data points by their tool type, we can clearly see that given the same level of the speed, type B tends to have longer life than type A.
So if we simply ignore the tool type, and predict the tool life based solely on the speed, we are gonna have pretty large prediction error.
Our fitted line is sort of between the two types, so we underestimates the life of tool B, and overestimate the life of tool A.
To model and describe the effect of the categorical variable, we use a so-called dummy variable, or indicator variable.
For example here, we consider two different types of cutting tool, A and B.
\(x_2 = \begin{cases} 0 & \quad \text{Tool type A}\\ 1 & \quad \text{Tool type B} \end{cases}\)
If the categorical variable has two categories, we need one dummy variable.
In general, if a categorical variable has \(k\) categories, we are gonna need \(k-1\) dummy variables put in the regression model.
first-order means no interaction terms as well as higher polynomial orders.

Interpretation of Coefficients

For Tool type A \((x_2 = 0)\) the model becomes: \[\begin{align} y &= \beta_0+\beta_1x_1+\beta_2(0) + \epsilon \\ &= \beta_0+\beta_1x_1+ \epsilon \end{align}\]
For Tool type B \((x_2 = 1)\) the model becomes: \[\begin{align} y &= \beta_0+\beta_1x_1+\beta_2(1) + \epsilon \\ &= (\beta_0 + \beta_2)+\beta_1x_1+ \epsilon \end{align}\]
Changing from type A to B induces a change in the intercept, but the slope is unchanged and identical.
Type A is the baseline level.

Parallel Regression Lines

Two parallel regression lines with a common slope \(\beta_1\) and different intercepts.
\(\beta_2\) measures the difference in mean tool life resulting from changing from tool type A to B.

\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \epsilon\)

\(\hat{y} = b_0 + b_1 x_1 + b_2 x_2\)

R Lab Model Fitting

The categorical variable should be of type character or factor.

str(tool_data)

'data.frame':   20 obs. of  3 variables:
 $ hours: num  18.7 14.5 17.4 14.5 13.4 ...
 $ speed: num  610 950 720 840 980 530 680 540 890 730 ...
 $ type : chr  "A" "A" "A" "A" ...

full_model <- lm(hours ~ speed + type, data = tool_data)
summary(full_model)$coef

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   36.986     3.5104    10.5  7.2e-09
speed         -0.027     0.0045    -5.9  1.8e-05
typeB         15.004     1.3597    11.0  3.6e-09

\(\hat{y} = 37 -0.027x_1 +15x_2\)
All else held constant, type B tools are expected, on average, to have 15 hours longer life than the baseline.

The dummy variable will be created based on the factor level that is created when the categorical variable is transformed to be of type factor.
By default, the factor level is created in alphabetical order.
Can specify level using argument levels in factor() function.
If factor has been created, can use relevel() to relevel the factor.

  B
A 0
B 1

Model Checking

Same variance of the errors for both A and B?

Approximately normal

Single Model vs. Separate Models

Two separate models, one for each type, could have been fit to the data. \[y^A = \beta_0^A+\beta_1x_1^A+ \epsilon^A, \quad \epsilon^A \sim N(0, \sigma^2)\] \[y^B = \beta_0^B+\beta_1x_1^B+ \epsilon^B, \quad \epsilon^B \sim N(0, \sigma^2)\]

If performing well, the single-model approach with dummy variables is preferred.

Only one equation to work with, \(y = \beta_0+\beta_1x_1+\beta_2x_2 + \epsilon\), a simpler practical result.

Both lines are assumed to have the same slope \(\beta_1\) and error variance \(\sigma^2\).
- Combine the data to produce a single estimate of the common parameters.
- Use more data to estimate the parameters, and the estimation quality would be better.

Difference in Slope

If we expect the slopes to differ, include an interaction term between the variables: \[y = \beta_0+\beta_1x_1+\beta_2x_2 + \color{blue}{\beta_3x_1x_2} + \epsilon\]

Tool type A \((x_2 = 0)\): \[\begin{align} y &= \beta_0+\beta_1x_1+\beta_2(0) + \beta_3x_1(0) + \epsilon \\ &= \beta_0+\beta_1x_1+ \epsilon \end{align}\]
Tool type B \((x_2 = 1)\): \[\begin{align} y &= \beta_0+\beta_1x_1+\beta_2(1) + \beta_3x_1(1) + \epsilon \\ &= (\beta_0+ \beta_2) + (\beta_1+\beta_3)x_1+ \epsilon \end{align}\]
- \(\beta_2\) is the change in the intercept caused by changing from type A to type B.
- \(\beta_3\) is the change in the slope caused by changing from type A to type B.

Response Function for the Tool Life Example

\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_1x_2 + \epsilon\) defines two regression lines with different slopes and intercepts.

\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_1x_2+\epsilon\)

\(\hat{y} = b_0 + b_1 x_1 + b_2x_2 + b_3 x_1 x_2\)

Two Models

The model \(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_1x_2 + \epsilon\) is equivalent to fitting two separate regressions:
- \(y = \beta_0+\beta_1x_1+ \epsilon\)
- \(y = \alpha_0 + \alpha_1x_1+ \epsilon\), \(\quad \alpha_0 = \beta_0 + \beta_2\), \(\quad \alpha_1 = \beta_1 + \beta_3\).

How do we test if the 2 regressions are identical?

Can use the extra sum of squares method by comparing the full and reduced models.
\(H_0: \beta_2 = \beta_3 = 0 \quad H_1: \beta_2 \ne 0 \text{ and(or) } \beta_3 \ne 0\)

R Lab Regression Model with Interaction

(full_model <- lm(hours ~ speed + type + speed:type, data = tool_data))


Call:
lm(formula = hours ~ speed + type + speed:type, data = tool_data)

Coefficients:
(Intercept)        speed        typeB  speed:typeB  
    32.7748      -0.0210      23.9706      -0.0119

(full_model <- lm(hours ~ speed*type, data = tool_data))


Call:
lm(formula = hours ~ speed * type, data = tool_data)

Coefficients:
(Intercept)        speed        typeB  speed:typeB  
    32.7748      -0.0210      23.9706      -0.0119

The fitted model is \[\hat{y} = 32.8-0.02x_1+23.97x_2 -0.01x_1x_2\]

R Lab Regression Lines

\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \epsilon\)

\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_1x_2+\epsilon\)

R Lab Test Effect of Tool Type

\(H_0: \beta_2 = \beta_3 = 0 \quad H_1: \beta_2 \ne 0 \text{ and(or) } \beta_3 \ne 0\)

reduced_model <- lm(hours ~ speed, data = tool_data)
anova(reduced_model, full_model)

Analysis of Variance Table

Model 1: hours ~ speed
Model 2: hours ~ speed * type
  Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
1     18 1282                              
2     16  141  2      1141 64.8 2.1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(F_{test} = \frac{SS_R(\beta_2, \beta_3 |\beta_1, \beta_0)/2}{MS_{res}} = \frac{1141/2}{141/16} = 64.75 > F_{\alpha, 2, 20-4}\)
The two regression lines are not identical.

More than 2 Categories

For a categorical predictor with \(m\) levels, we need \(m-1\) dummies.
Three tool types, A, B, and C. Then two indicators \(x_2\) and \(x_3\) will be needed:

Tool Type	\(x_2\)	\(x_3\)
A	0	0
B	1	0
C	0	1

The regression model (common slope) is \[y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \epsilon\]
Type A is the baseline level.

Example 8.3: More Than 2 Levels (LRA)

An electric utility is investigating the effect of the size of a single family house \((x_1)\) and the type of air conditioning used on the total electricity consumption \((y)\).

Type of Air Conditioning	\(x_2\)	\(x_3\)	\(x_4\)
No air conditioning	0	0	0
Window units	1	0	0
Heat pump	0	1	0
Central air conditioning	0	0	1

Which type is the baseline level?

Example 8.3: Dummy Variables

The regression model is \(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \epsilon\)

“No air conditioning” is the baseline level.

Type of Air Conditioning	\(x_2\)	\(x_3\)	\(x_4\)
No air conditioning	0	0	0
Window units	1	0	0
Heat pump	0	1	0
Central air conditioning	0	0	1

If the house has

no air conditioning,

\[y = \beta_0+\beta_1x_1 + \epsilon\]

window units,

\[y = (\beta_0 + \beta_2)+\beta_1x_1 + \epsilon\]

a heat pump,

\[y = (\beta_0 + \beta_3) +\beta_1x_1 + \epsilon\]

central air conditioning,

\[y = (\beta_0 + \beta_4) +\beta_1x_1 + \epsilon\]

The type “No air conditioning” is the baseline level.
The model assumes that the relationship between electricity consumption and the size of the house is linear and the slope does not depend on the type of air conditioning system employed.
The parameters β 2 , β 3 , and β 4 modify the height (or intercept) of the regression model for the different types of air conditioning systems.
That is, β 2 , β 3 , and β 4 measure the effect of window units, a heat pump, and a central air conditioning system, respectively, compared to no air conditioning.
Furthermore, other effects can be determined by directly comparing the appropriate regression coefficients. For example, β 3 − β 4 reflects the relative efficiency of a heat pump compared to central air conditioning.
Negative means saving more energy.

Example 8.3: Interaction

Do you think the model \(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \epsilon\) is reasonable?

It seems unrealistic to assume that the slope \(\beta_1\) relating mean electricity consumption to the house size does NOT depend on air-conditioning type.
The consumption increases with the house size.
The rate of increase should be different because a more efficient central air conditioning system should have a consumption rate lower than window units.
Add interaction between the house size and the type of air conditioning: \[y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_1x_2 + \beta_6 x_1x_3 + \beta_7 x_1x_4 + \epsilon\]

It would seem unrealistic to assume that the slope of the regression function relating mean electricity consumption to the size of the house does not depend on the type of air conditioning system.
We would expect the mean electricity consumption to increase with the size of the house, but the rate of increase should be different for a central air conditioning system than for window units because central air conditioning should be more efficient than window units for larger houses.
There should be an interaction between the size of the house and the type of air conditioning system: \[y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_1x_2 + \beta_6 x_1x_3 + \beta_7 x_1x_4 + \epsilon\]
The assumption that the variance of energy consumption does not depend on the type of air conditioning system used may be inappropriate.

Example 8.3: Unique Slope and Intercept

\(y = \beta_0+\beta_1x_1+\beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_1x_2 + \beta_6 x_1x_3 + \beta_7 x_1x_4 + \epsilon\)