MATH 4780 - Fall 2023 - Collinearity

What is Collinearity

Collinearity refers to the situation in which two or more predictors are closely related to one another.
limit and age appear to have no obvious relationship, which is good!
limit and rating are highly correlated, and they are said to be collinear.

Collinearity is a fundamental problem with the data rather than model specification.
There is usually no satisfactory solution for a true collinearity problem.
Collinearity means some regressors in your model are highly correlated.
We want to have regressors that are NOT moving with each other.
Ideally we desire to have orthogonal regressors.
The intuition is, we hope predictors can explain the variation of \(y\), right? That’s why we put these predictors in the model.
And ideally we hope each predictor can explain a part of variation of \(y\) that can only be explained by that predictor and cannot be explained by any other predictors.
This kind of partition can be done if all the predictors are orthogonal.
If x1 and x2 are highly correlated, meaning that they are moving together, then the two predictors are gonna explain a large part of the same variation of of \(y\).
Think about it. If x1 and y are correlated in some way, and x1 and x2 are highly correlated, it means that x2 and y are going to be correlated in same way as x1 and y.
So we actually use two predictors to explain the same variation of \(y\), which is redundant.
The model will be confused and may not be able to understand this explained variation is due to x1 or due to x2.
Later, we’ll see why we don’t want the predictors to be correlated.
There are lots of bad effects on our regression model.

Sources of Collinearity

Four primary sources

The data collection method employed
Constraints on the model or in the population
Model specification
A model with \(p>n\)

Data Collection

Collinearity occurs when only a subspace of the entire sample space has been explored.
May be able to reduce this collinearity through the sampling technique used.
There is no physical reason why you can’t sample in that area.

Constraints

Physical constraints are present, and the collinearity will exist regardless of collection method.

Model Specification

Polynomial terms can cause ill-conditioning in \({\bf X'X}\).
As the order of the model increases, \({\bf X'X}\) matrix inversion will become inaccurate, and error can be introduced into the parameter estimates.
If range on a regressor variable is small, adding an \(x^2\) term can result in significant collinearity.

\(p>n\) Model

More regressor variables than observations.
The best way to counter this is to remove/reconstruct regressor variables.
- Principal Component Regression
- Variable Selection (Next Topic)

Effect of Collinearity

👉 Large variances and covariances for the LSE \(b_j\)s.

👉 Tends to produce LSE \(b_j\) that are too large in absolute value. Therefore, the vector \({\bf b}\), on average, is much longer than the vector \(\boldsymbol \beta\).

Large variances and large magnitude of coefficients lead to instable and wrong signed coefficients.
Poor coefficients do not necessarily imply bad fit or poor prediction.
The predictions should be confined to the \(x\) space where the collinearity holds approximately.
Collinearity causes very poor extrapolated prediction.

Perfectly Correlated Regressors

Suppose the true population regression equation is \(y = 3 + 4x\).
Suppose we try estimating that equation using perfected correlated variables \(x\) and \(z = x/10\).

\[ \begin{aligned}\hat{y}&= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2z\\ &= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2\frac{x}{10}\\ &= \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)x \end{aligned} \]

Can set \(\hat{\beta}_1\) and \(\hat{\beta}_2\) to any two numbers such that \(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10} = 4\).
Unable to choose the “best” combination of \(\hat{\beta}_1\) and \(\hat{\beta}_2\).

Collinearity is a fundamental problem with the data rather than model specification.
There is usually no satisfactory solution for a true collinearity problem.
Collinearity means some regressors in your model are highly correlated.
We want to have regressors that are NOT moving with each other.
Ideally we desire to have orthogonal regressors.
The intuition is, we hope predictors can explain the variation of \(y\), right? That’s why we put these predictors in the model.
And ideally we hope each predictor can explain a part of variation of \(y\) that can only be explained by that predictor and cannot be explained by any other predictors.
This kind of partition can be done if all the predictors are orthogonal.
If x1 and x2 are highly correlated, meaning that they are moving together, then the two predictors are gonna explain a large part of the same variation of of \(y\).
Think about it. If x1 and y are correlated in some way, and x1 and x2 are highly correlated, it means that x2 and y are going to be correlated in same way as x1 and y.
So we actually use two predictors to explain the same variation of \(y\), which is redundant.
The model will be confused and may not be able to understand this explained variation is due to x1 or due to x2.
Later, we’ll see why we don’t want the predictors to be correlated.
There are lots of bad effects on our regression model.

Collinearity Diagnostics

Ideal characteristics of a collinearity diagnostic:

Correctly indicate if collinearity is present
How severe the problem is
Provide insight as to which regressors are causing the problem

Examination of the Correlation Matrix of \(x\)s

After unit length scaling¹, \({\bf X'X} = \left[r_{ij}\right]_{k\times k}\) is the correlation matrix of \(x\)s denoted as \({\bf \Sigma}\). ² For example, \[{\bf X'X} = \begin{bmatrix} 1 & 0.992 \\ 0.992 & 1 \end{bmatrix}\]
\(r_{ij}\) is the pairwise correlation between \(x_i\) and \(x_j\).
Large \(|r_{ij}|\) is an indication of collinearity.
When more than two regressors are involved in collinearity, there may be instances when collinearity is present, but the pairwise correlations are not large.
Inspecting \(r_{ij}\) is not sufficient for detecting more complex collinearity.

After unit length scaling, \({\bf X'X}\) is the correlation matrix of regressors. - If we scale and center the regressors, we have the correlation matrix. \[{\bf X'X} = \begin{bmatrix} 1 & 0.992 \\ 0.992 & 1 \end{bmatrix}\] - The off diagonal elements of the centered and scaled \({\bf X'X}\) matrix are the pairwise correlations between \(x_i\) and \(x_j\), denoted as \(r_{ij}\). For example, \(r_{12} = 0.992\). - If \(|r_{ij}| \approx 1\), there is an indication of Collinearity. But, the opposite does not always hold. - When there are more than two regressors, there may be instances when Collinearity is present, but the pairwise correlations do not indicate a problem. (Webster, Gunst, and Mason [1974] Table 9.4) - Inspection of the \(r_{ij}\) is not sufficient for detecting anything more complex than pairwise Collinearity.

Variance Inflation Factors

The diagonals of \({\bf \Sigma}^{-1} = {\bf C}\) in correlation form are called variance inflation factors \[\text{VIF}_j = {\bf C}_{jj}\]
Example: \[{\bf \Sigma} = \begin{bmatrix} 1 & 0.992 \\ 0.992 & 1 \end{bmatrix}; \quad {\bf \Sigma}^{-1} = \begin{bmatrix} 62.8 & -62.2 \\ -62.2 & 62.8 \end{bmatrix}\] and \(\text{VIF}_1 = 62.8\).

The collinearity produces an inflation in the variances of the estimated coefficients, an increase in 60-fold over the ideal case when the two regressors are orthogonal.
VIFs \(> 10\) are considered significant.

Variance Inflation Factors

\[\text{VIF}_j = \frac{1}{1 - R^2_{X_j | X_{-j}}}\] where \(R^2_{X_j | X_{-j}}\) is the coefficient of determination obtained when \(x_j\) is regressed on the other regressors \(x_i, i \ne j\).

\[\mathrm{Var}(b_j) = \frac{s^2}{\sum_{i=1}^n(x_{ij} - \bar{x}_j)^2} \times \text{VIF}_j\]

\(\text{VIF}_j\) measures the combined effect of the dependencies among the regressors on the variance of \(b_j\).
If \(x_j\) is near linearly dependent on some subset of the remaining regressors, \({\bf C}_{jj}\) is large.

Remember what does linear (in)dependence mean?

\[\text{VIF}_j = {\bf C}_{jj} = \frac{1}{1 - R^2_j}\] + \(R_j^2\): the coefficient of determination obtained when \(x_j\) is regressed on the remaining regressors \(x_i, i \ne j\). - If \(x_j\) can be explained a lot from other regressors, \(x_j\) is probably unnecessary in the regression model when all others are in the model. When it is in the model, the model cannot understand what the real effect the predictor can provide, and therefore its coefficient has a large variance. - The regressors that have high VIFs probably have poorly estimated coefficients. - The Collinearity produces an inflation in the variances of the estimated coefficients, an increase in 60-fold over the ideal case when the two regressors are orthogonal. - Since the variance of the j th regression coefficients is C jj σ 2 , we can view C jj as the factor by which the variance of ˆβj is increased due to near - linear dependences among the regressors.

R Lab Hospital Manpower Data

manpower

       y  x1    x2    x3    x4   x5
1    567  16  2463   473  18.0  4.5
2    697  44  2048  1340   9.5  6.9
3   1033  20  3940   620  12.8  4.3
4   1604  19  6505   568  36.7  3.9
5   1611  49  5723  1498  35.7  5.5
6   1613  45 11520  1366  24.0  4.6
7   1854  55  5779  1687  43.3  5.6
8   2161  59  5969  1640  46.7  5.2
9   2306  94  8461  2872  78.7  6.2
10  3504 128 20106  3655 180.5  6.2
11  3572  96 13313  2912  60.9  5.9
12  3741 131 10771  3921 103.7  4.9
13  4027 127 15543  3866 126.8  5.5
14 10344 253 36194  7684 157.7  7.0
15 11732 409 34703 12446 169.4 10.8
16 15415 464 39204 14098 331.4  7.0
17 18854 510 86533 15524 371.6  6.3

\(y\): Monthly man-hours

\(x_1\): Average daily patient load

\(x_2\): Monthly X-ray exposures

\(x_3\): Monthly occupied bed days

\(x_4\): Eligible population in the area / 1000

\(x_5\): Average length of patients’ stay in days

Do you expect to see positive or negative relationship between \(y\) and \(x_i\)?

R Lab Hospital Manpower - Pairwise Dependence

R Lab Hospital Manpower - Model Fit

lm_full <- lm(y ~ ., data = manpower)
(summ_full <- summary(lm_full))

...
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 1962.9482  1071.3617    1.83    0.094 .
x1           -15.8517    97.6530   -0.16    0.874  
x2             0.0559     0.0213    2.63    0.023 *
x3             1.5896     3.0921    0.51    0.617  
x4            -4.2187     7.1766   -0.59    0.569  
x5          -394.3141   209.6395   -1.88    0.087 .
Residual standard error: 642 on 11 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.987 
...

Excellent fit. But any issues of the this fitted result?

The coefficients \(b_1\), \(b_4\) and \(b_5\) are negative.
In the case of \(x_1\), an increase in patient load, when other \(x\)’s are held constant, corresponds to a decrease in hospital manpower. (wrong sign due to large variance)

R Lab Hospital Manpower - VIF

X <- manpower[, -1]
(Sig <- cor(X))

     x1   x2   x3   x4   x5
x1 1.00 0.91 1.00 0.94 0.67
x2 0.91 1.00 0.91 0.91 0.45
x3 1.00 0.91 1.00 0.93 0.67
x4 0.94 0.91 0.93 1.00 0.46
x5 0.67 0.45 0.67 0.46 1.00

(C <- solve(Sig))

      x1    x2    x3     x4    x5
x1  9598  11.9 -9247 -318.8 -93.9
x2    12   7.9   -18   -1.9   1.8
x3 -9247 -18.5  8933  294.4  83.4
x4  -319  -1.9   294   23.3   6.4
x5   -94   1.8    83    6.4   4.3

## VIF 
diag(C)

    x1     x2     x3     x4     x5 
9597.6    7.9 8933.1   23.3    4.3

## put the fitted model in vif()
(vif_all <- car::vif(lm_full))

    x1     x2     x3     x4     x5 
9597.6    7.9 8933.1   23.3    4.3

\(x_1\) Average daily patient load and \(x_3\) Monthly occupied bed days are highly correlated.

R Lab Hospital Manpower - Confidence Interval

Marginally, \(b_1\) and \(b_3\) vary a lot. The CI for \(\beta_1\) and CI for \(\beta_3\) both contain zero.

    2.5 % 97.5 %
x1 -230.8  199.1
x3   -5.2    8.4

But very confident that \(\beta_1\) and \(\beta_3\) cannot be both zero.

R Lab Hospital Manpower - Confidence Interval

\(x_4\) and \(x_5\) are not highly correlated pairwisely. \(r_{45} = 0.46\).
However, their CI is still inflated due to the collinearity effect of other variables.

   2.5 % 97.5 %
x4   -20     12
x5  -856     67

Eigensystem Analysis: Condition Indices

The eigenvalues of \({\bf \Sigma}\) (the correlation matrix of \(\bf X\)), \(\lambda_1, \lambda_2, \dots, \lambda_k\), can measure collinearity.
If there are one or more near-linear dependencies, one or more of the \(\lambda_i\)s will be (relatively) small.
Condition indices of \({\bf \Sigma}\) are \(\kappa_j = \frac{\lambda_{max}}{\lambda_{j}}\).
The number of \(\kappa_j > 1000\) is a measure of the number of near-linear dependencies in \({\bf \Sigma}.\)

R Lab Hospital Manpower - Eigensystem Analysis

eigen_Sig <- eigen(Sig)
## eigenvalues
(lambda <- eigen_Sig$values)

[1] 4.2e+00 6.7e-01 9.5e-02 4.1e-02 5.4e-05

## Conditional indices
max(lambda) / lambda

[1]     1.0     6.3    44.4   103.1 77769.7

\(\lambda_5 \approx 0\) and \(\kappa_5 \approx 77770\), indicating collinearity.

Eigenvalues are listed in a decreasing order, \(\lambda_1 > \lambda_2 > \cdots > \lambda_k\), and \(\lambda_5\) is not the eigenvalue of \(x_5\).

      [,1]     [,2]  [,3]  [,4]    [,5]
[1,] -0.49 -0.00203 -0.17 -0.47  0.7195
[2,] -0.45 -0.33561  0.80  0.19  0.0012
[3,] -0.48 -0.00085 -0.15 -0.51 -0.6941
[4,] -0.46 -0.31080 -0.54  0.63 -0.0234
[5,] -0.33  0.88925  0.12  0.29 -0.0068

[1] 1 1 1 1 1

There is one near linear dependency.

Eigensystem Analysis: Eigendecomposition

Eigendecomposition \[{\bf \Sigma = V\boldsymbol \Lambda V'}\]
- \(\boldsymbol \Lambda\) is a \(k \times k\) diagonal matrix whose elements are \(\lambda_j\).
- \({\bf V} = [{\bf v}_1 \quad {\bf v}_2 \quad \dots \quad {\bf v}_k]\) is a \(k \times k\) orthogonal matrix whose columns are the eigenvectors of \({\bf \Sigma}\).
If \(\lambda_j \approx 0\), the associated \({\bf v}_j = (v_{1j}, v_{2j}, \dots, v_{kj})'\) describes how (and what) regressors are linearly dependent: \[\sum_{i=1}^kv_{ij}{\bf x}_i \cong \mathbf{0}\]

R Lab Hospital Manpower - Eigensystem Analysis

## eigenvector matrix
(V <- eigen_Sig$vectors)

      [,1]     [,2]  [,3]  [,4]    [,5]
[1,] -0.49 -0.00203 -0.17 -0.47  0.7195
[2,] -0.45 -0.33561  0.80  0.19  0.0012
[3,] -0.48 -0.00085 -0.15 -0.51 -0.6941
[4,] -0.46 -0.31080 -0.54  0.63 -0.0234
[5,] -0.33  0.88925  0.12  0.29 -0.0068

\({\bf X}{\bf v}_5 = \sum_{i=1}^5v_{i5}{\bf x}_i \approx {\bf 0}\).
\(0.720 {\bf x}_1 + 0.001 {\bf x}_2 - 0.694 {\bf x}_3 - 0.023 {\bf x}_4 - 0.007 {\bf x}_5 \approx {\bf 0}\)
Highly correlated \(x_1\) and \(x_3\) causes collinearity.

All \({\bf x}\)s here are unit length scaled.

X_s <- apply(X, 2, unit_len_scale)

X_s %*% V[, 5]

          [,1]
 [1,] -0.00037
 [2,] -0.00144
 [3,]  0.00032
 [4,] -0.00058
 [5,] -0.00108
 [6,]  0.00047
 [7,] -0.00131
 [8,]  0.00492
 [9,] -0.00225
[10,]  0.00230
[11,] -0.00050
[12,]  0.00209
[13,] -0.00251
[14,] -0.00016
[15,]  0.00131
[16,] -0.00098
[17,] -0.00023

     [,1] [,2]  [,3]  [,4]    [,5]
[1,]  4.2 0.00 0.000 0.000 0.0e+00
[2,]  0.0 0.67 0.000 0.000 0.0e+00
[3,]  0.0 0.00 0.095 0.000 0.0e+00
[4,]  0.0 0.00 0.000 0.041 0.0e+00
[5,]  0.0 0.00 0.000 0.000 5.4e-05

## Orthogonal matrix
round(t(V)%*%V, 1)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

##  V(Lambda)V' = X'X
V %*% diag(lambda) %*% t(V)

     [,1] [,2] [,3] [,4] [,5]
[1,] 1.00 0.91 1.00 0.94 0.67
[2,] 0.91 1.00 0.91 0.91 0.45
[3,] 1.00 0.91 1.00 0.93 0.67
[4,] 0.94 0.91 0.93 1.00 0.46
[5,] 0.67 0.45 0.67 0.46 1.00

## near-zero vector
X_s%*%V[, 5]

          [,1]
 [1,] -0.00037
 [2,] -0.00144
 [3,]  0.00032
 [4,] -0.00058
 [5,] -0.00108
 [6,]  0.00047
 [7,] -0.00131
 [8,]  0.00492
 [9,] -0.00225
[10,]  0.00230
[11,] -0.00050
[12,]  0.00209
[13,] -0.00251
[14,] -0.00016
[15,]  0.00131
[16,] -0.00098
[17,] -0.00023

Other Diagnostics

Collinearity may exist if

overall \(F\)-test for regression is significant, but individual \(t\)-tests are all non-significant.
the coefficient estimates are instable
- adding or removing a regressor produces large changes in the estimates
- deleting one or more observations results in large changes in the estimates
- if the signs or magnitudes of the estimates are contrary to prior expectation

R Lab Hospital Manpower - Significance

summ_full$coefficients  ## t-test not significant

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 1962.948    1.1e+03    1.83    0.094
x1           -15.852    9.8e+01   -0.16    0.874
x2             0.056    2.1e-02    2.63    0.023
x3             1.590    3.1e+00    0.51    0.617
x4            -4.219    7.2e+00   -0.59    0.569
x5          -394.314    2.1e+02   -1.88    0.087

summ_full$fstatistic  ## F-test significant

value numdf dendf 
  238     5    11

lm_no_x3 <- lm(y ~ . -x3, data = manpower)
summary(lm_no_x3)$coef

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 2161.962    967.871     2.2  4.5e-02
x1            34.284      4.897     7.0  1.4e-05
x2             0.057      0.021     2.8  1.7e-02
x4            -6.600      5.311    -1.2  2.4e-01
x5          -440.297    183.696    -2.4  3.4e-02

Methods for Dealing with Collinearity

Data collection: Collect more data to break up the collinearity in the existing data

Model specification/An overdefined model: Respecify the model

redefining the regressors : Use \(x = x_1+x_2\) or \(x = x_1x_2\).
- Avoid combining regressors in different units.
eliminating regressors : remove \(x_1\) or \(x_2\).
- May damage the predictive power if the removed regressors have significant explanatory power. (Variable selection)
- If we remove \(x_2\), we estimate the marginal relationship between \(y\) and \(x_1\), ignoring \(x_2\), rather than the partial relationship conditioning on \(x_2\).

Constraint on the model or in the population: Say goodbye to least-squares estimation.

Ridge Regression, Principal Component Regression, Bayesian Regression, etc

–> –> –>

Collinearity

Collinearity

Meaning

Sources

Effects

Diagnostics

Solutions

What is Collinearity

Sources of Collinearity

Data Collection

Constraints

Model Specification

\(p>n\) Model

Effect of Collinearity

Perfectly Correlated Regressors

Collinearity Diagnostics

Examination of the Correlation Matrix of \(x\)s

Variance Inflation Factors

Variance Inflation Factors

R Lab Hospital Manpower Data

R Lab Hospital Manpower - Pairwise Dependence

R Lab Hospital Manpower - Model Fit

R Lab Hospital Manpower - VIF

R Lab Hospital Manpower - Confidence Interval

R Lab Hospital Manpower - Confidence Interval

Eigensystem Analysis: Condition Indices

R Lab Hospital Manpower - Eigensystem Analysis

Eigensystem Analysis: Eigendecomposition

R Lab Hospital Manpower - Eigensystem Analysis

Other Diagnostics

R Lab Hospital Manpower - Significance

Methods for Dealing with Collinearity