Collinearity

MATH 4780 / MSSC 5780 Regression Analysis

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Collinearity

Meaning

Sources

Effects

Diagnostics

Solutions

What is Collinearity

  • Collinearity refers to the situation in which two or more predictors are closely related to one another.
  • limit and age appear to have no obvious relationship, which is good!
  • limit and rating are highly correlated, and they are said to be collinear.

Sources of Collinearity

Four primary sources

  • The data collection method employed
  • Constraints on the model or in the population
  • Model specification
  • A model with \(p>n\)

Data Collection

  • Collinearity occurs when only a subspace of the entire sample space has been explored.
  • May be able to reduce this collinearity through the sampling technique used.
  • There is no physical reason why you can’t sample in that area.

Constraints

  • Physical constraints are present, and the collinearity will exist regardless of collection method.

Model Specification

  • Polynomial terms can cause ill-conditioning in \({\bf X'X}\).
  • As the order of the model increases, \({\bf X'X}\) matrix inversion will become inaccurate, and error can be introduced into the parameter estimates.
  • If range on a regressor variable is small, adding an \(x^2\) term can result in significant collinearity.

\(p>n\) Model

  • More regressor variables than observations.
  • The best way to counter this is to remove/reconstruct regressor variables.
    • Principal Component Regression
    • Variable Selection (Next Topic)

Effect of Collinearity

  1. 👉 Large variances and covariances for the LSE \(b_j\)s.
  1. 👉 Tends to produce LSE \(b_j\) that are too large in absolute value. Therefore, the vector \({\bf b}\), on average, is much longer than the vector \(\boldsymbol \beta\).
  • Large variances and large magnitude of coefficients lead to instable and wrong signed coefficients.
  • Poor coefficients do not necessarily imply bad fit or poor prediction.
  • The predictions should be confined to the \(x\) space where the collinearity holds approximately.
  • Collinearity causes very poor extrapolated prediction.

Perfectly Correlated Regressors

  • Suppose the true population regression equation is \(y = 3 + 4x\).

  • Suppose we try estimating that equation using perfected correlated variables \(x\) and \(z = x/10\).

\[ \begin{aligned}\hat{y}&= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2z\\ &= \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2\frac{x}{10}\\ &= \hat{\beta}_0 + \bigg(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10}\bigg)x \end{aligned} \]

  • Can set \(\hat{\beta}_1\) and \(\hat{\beta}_2\) to any two numbers such that \(\hat{\beta}_1 + \frac{\hat{\beta}_2}{10} = 4\).

  • Unable to choose the “best” combination of \(\hat{\beta}_1\) and \(\hat{\beta}_2\).

Collinearity Diagnostics

Ideal characteristics of a collinearity diagnostic:

  • Correctly indicate if collinearity is present
  • How severe the problem is
  • Provide insight as to which regressors are causing the problem

Examination of the Correlation Matrix of \(x\)s

  • After unit length scaling1, \({\bf X'X} = \left[r_{ij}\right]_{k\times k}\) is the correlation matrix of \(x\)s denoted as \({\bf \Sigma}\). 2 For example, \[{\bf X'X} = \begin{bmatrix} 1 & 0.992 \\ 0.992 & 1 \end{bmatrix}\]

  • \(r_{ij}\) is the pairwise correlation between \(x_i\) and \(x_j\).

  • Large \(|r_{ij}|\) is an indication of collinearity.

  • When more than two regressors are involved in collinearity, there may be instances when collinearity is present, but the pairwise correlations are not large.

  • Inspecting \(r_{ij}\) is not sufficient for detecting more complex collinearity.

Variance Inflation Factors

  • The diagonals of \({\bf \Sigma}^{-1} = {\bf C}\) in correlation form are called variance inflation factors \[\text{VIF}_j = {\bf C}_{jj}\]

  • Example: \[{\bf \Sigma} = \begin{bmatrix} 1 & 0.992 \\ 0.992 & 1 \end{bmatrix}; \quad {\bf \Sigma}^{-1} = \begin{bmatrix} 62.8 & -62.2 \\ -62.2 & 62.8 \end{bmatrix}\] and \(\text{VIF}_1 = 62.8\).

  • The collinearity produces an inflation in the variances of the estimated coefficients, an increase in 60-fold over the ideal case when the two regressors are orthogonal.
  • VIFs \(> 10\) are considered significant.

Variance Inflation Factors

\[\text{VIF}_j = \frac{1}{1 - R^2_{X_j | X_{-j}}}\] where \(R^2_{X_j | X_{-j}}\) is the coefficient of determination obtained when \(x_j\) is regressed on the other regressors \(x_i, i \ne j\).

\[\mathrm{Var}(b_j) = \frac{s^2}{\sum_{i=1}^n(x_{ij} - \bar{x}_j)^2} \times \text{VIF}_j\]

  • \(\text{VIF}_j\) measures the combined effect of the dependencies among the regressors on the variance of \(b_j\).

  • If \(x_j\) is near linearly dependent on some subset of the remaining regressors, \({\bf C}_{jj}\) is large.

Remember what does linear (in)dependence mean?

R Lab Hospital Manpower Data

manpower
       y  x1    x2    x3    x4   x5
1    567  16  2463   473  18.0  4.5
2    697  44  2048  1340   9.5  6.9
3   1033  20  3940   620  12.8  4.3
4   1604  19  6505   568  36.7  3.9
5   1611  49  5723  1498  35.7  5.5
6   1613  45 11520  1366  24.0  4.6
7   1854  55  5779  1687  43.3  5.6
8   2161  59  5969  1640  46.7  5.2
9   2306  94  8461  2872  78.7  6.2
10  3504 128 20106  3655 180.5  6.2
11  3572  96 13313  2912  60.9  5.9
12  3741 131 10771  3921 103.7  4.9
13  4027 127 15543  3866 126.8  5.5
14 10344 253 36194  7684 157.7  7.0
15 11732 409 34703 12446 169.4 10.8
16 15415 464 39204 14098 331.4  7.0
17 18854 510 86533 15524 371.6  6.3

\(y\): Monthly man-hours


\(x_1\): Average daily patient load


\(x_2\): Monthly X-ray exposures


\(x_3\): Monthly occupied bed days


\(x_4\): Eligible population in the area / 1000


\(x_5\): Average length of patients’ stay in days

Do you expect to see positive or negative relationship between \(y\) and \(x_i\)?

R Lab Hospital Manpower - Pairwise Dependence

R Lab Hospital Manpower - Model Fit

lm_full <- lm(y ~ ., data = manpower)
(summ_full <- summary(lm_full))
...
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 1962.9482  1071.3617    1.83    0.094 .
x1           -15.8517    97.6530   -0.16    0.874  
x2             0.0559     0.0213    2.63    0.023 *
x3             1.5896     3.0921    0.51    0.617  
x4            -4.2187     7.1766   -0.59    0.569  
x5          -394.3141   209.6395   -1.88    0.087 .
Residual standard error: 642 on 11 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.987 
...

Excellent fit. But any issues of the this fitted result?

  • The coefficients \(b_1\), \(b_4\) and \(b_5\) are negative.
  • In the case of \(x_1\), an increase in patient load, when other \(x\)’s are held constant, corresponds to a decrease in hospital manpower. (wrong sign due to large variance)

R Lab Hospital Manpower - VIF

X <- manpower[, -1]
(Sig <- cor(X))
     x1   x2   x3   x4   x5
x1 1.00 0.91 1.00 0.94 0.67
x2 0.91 1.00 0.91 0.91 0.45
x3 1.00 0.91 1.00 0.93 0.67
x4 0.94 0.91 0.93 1.00 0.46
x5 0.67 0.45 0.67 0.46 1.00
(C <- solve(Sig))
      x1    x2    x3     x4    x5
x1  9598  11.9 -9247 -318.8 -93.9
x2    12   7.9   -18   -1.9   1.8
x3 -9247 -18.5  8933  294.4  83.4
x4  -319  -1.9   294   23.3   6.4
x5   -94   1.8    83    6.4   4.3
## VIF 
diag(C)
    x1     x2     x3     x4     x5 
9597.6    7.9 8933.1   23.3    4.3 
## put the fitted model in vif()
(vif_all <- car::vif(lm_full))
    x1     x2     x3     x4     x5 
9597.6    7.9 8933.1   23.3    4.3 

\(x_1\) Average daily patient load and \(x_3\) Monthly occupied bed days are highly correlated.

R Lab Hospital Manpower - Confidence Interval

  • Marginally, \(b_1\) and \(b_3\) vary a lot. The CI for \(\beta_1\) and CI for \(\beta_3\) both contain zero.
    2.5 % 97.5 %
x1 -230.8  199.1
x3   -5.2    8.4

  • But very confident that \(\beta_1\) and \(\beta_3\) cannot be both zero.

R Lab Hospital Manpower - Confidence Interval

  • \(x_4\) and \(x_5\) are not highly correlated pairwisely. \(r_{45} = 0.46\).
  • However, their CI is still inflated due to the collinearity effect of other variables.
   2.5 % 97.5 %
x4   -20     12
x5  -856     67

Eigensystem Analysis: Condition Indices

  • The eigenvalues of \({\bf \Sigma}\) (the correlation matrix of \(\bf X\)), \(\lambda_1, \lambda_2, \dots, \lambda_k\), can measure collinearity.
  • If there are one or more near-linear dependencies, one or more of the \(\lambda_i\)s will be (relatively) small.
  • Condition indices of \({\bf \Sigma}\) are \(\kappa_j = \frac{\lambda_{max}}{\lambda_{j}}\).
  • The number of \(\kappa_j > 1000\) is a measure of the number of near-linear dependencies in \({\bf \Sigma}.\)

R Lab Hospital Manpower - Eigensystem Analysis

eigen_Sig <- eigen(Sig)
## eigenvalues
(lambda <- eigen_Sig$values)
[1] 4.2e+00 6.7e-01 9.5e-02 4.1e-02 5.4e-05
## Conditional indices
max(lambda) / lambda
[1]     1.0     6.3    44.4   103.1 77769.7
  • \(\lambda_5 \approx 0\) and \(\kappa_5 \approx 77770\), indicating collinearity.

Eigenvalues are listed in a decreasing order, \(\lambda_1 > \lambda_2 > \cdots > \lambda_k\), and \(\lambda_5\) is not the eigenvalue of \(x_5\).

Eigensystem Analysis: Eigendecomposition

  • Eigendecomposition \[{\bf \Sigma = V\boldsymbol \Lambda V'}\]
    • \(\boldsymbol \Lambda\) is a \(k \times k\) diagonal matrix whose elements are \(\lambda_j\).
    • \({\bf V} = [{\bf v}_1 \quad {\bf v}_2 \quad \dots \quad {\bf v}_k]\) is a \(k \times k\) orthogonal matrix whose columns are the eigenvectors of \({\bf \Sigma}\).
  • If \(\lambda_j \approx 0\), the associated \({\bf v}_j = (v_{1j}, v_{2j}, \dots, v_{kj})'\) describes how (and what) regressors are linearly dependent: \[\sum_{i=1}^kv_{ij}{\bf x}_i \cong \mathbf{0}\]

R Lab Hospital Manpower - Eigensystem Analysis

## eigenvector matrix
(V <- eigen_Sig$vectors)
      [,1]     [,2]  [,3]  [,4]    [,5]
[1,] -0.49 -0.00203 -0.17 -0.47  0.7195
[2,] -0.45 -0.33561  0.80  0.19  0.0012
[3,] -0.48 -0.00085 -0.15 -0.51 -0.6941
[4,] -0.46 -0.31080 -0.54  0.63 -0.0234
[5,] -0.33  0.88925  0.12  0.29 -0.0068
  • \({\bf X}{\bf v}_5 = \sum_{i=1}^5v_{i5}{\bf x}_i \approx {\bf 0}\).
  • \(0.720 {\bf x}_1 + 0.001 {\bf x}_2 - 0.694 {\bf x}_3 - 0.023 {\bf x}_4 - 0.007 {\bf x}_5 \approx {\bf 0}\)
  • Highly correlated \(x_1\) and \(x_3\) causes collinearity.
  • All \({\bf x}\)s here are unit length scaled.
X_s <- apply(X, 2, unit_len_scale)
X_s %*% V[, 5]
          [,1]
 [1,] -0.00037
 [2,] -0.00144
 [3,]  0.00032
 [4,] -0.00058
 [5,] -0.00108
 [6,]  0.00047
 [7,] -0.00131
 [8,]  0.00492
 [9,] -0.00225
[10,]  0.00230
[11,] -0.00050
[12,]  0.00209
[13,] -0.00251
[14,] -0.00016
[15,]  0.00131
[16,] -0.00098
[17,] -0.00023

Other Diagnostics

Collinearity may exist if

  • overall \(F\)-test for regression is significant, but individual \(t\)-tests are all non-significant.

  • the coefficient estimates are instable

    • adding or removing a regressor produces large changes in the estimates
    • deleting one or more observations results in large changes in the estimates
    • if the signs or magnitudes of the estimates are contrary to prior expectation

R Lab Hospital Manpower - Significance

summ_full$coefficients  ## t-test not significant
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 1962.948    1.1e+03    1.83    0.094
x1           -15.852    9.8e+01   -0.16    0.874
x2             0.056    2.1e-02    2.63    0.023
x3             1.590    3.1e+00    0.51    0.617
x4            -4.219    7.2e+00   -0.59    0.569
x5          -394.314    2.1e+02   -1.88    0.087
summ_full$fstatistic  ## F-test significant
value numdf dendf 
  238     5    11 
lm_no_x3 <- lm(y ~ . -x3, data = manpower)
summary(lm_no_x3)$coef
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 2161.962    967.871     2.2  4.5e-02
x1            34.284      4.897     7.0  1.4e-05
x2             0.057      0.021     2.8  1.7e-02
x4            -6.600      5.311    -1.2  2.4e-01
x5          -440.297    183.696    -2.4  3.4e-02

Methods for Dealing with Collinearity

Data collection: Collect more data to break up the collinearity in the existing data


Model specification/An overdefined model: Respecify the model

  • redefining the regressors : Use \(x = x_1+x_2\) or \(x = x_1x_2\).
    • Avoid combining regressors in different units.
  • eliminating regressors : remove \(x_1\) or \(x_2\).
    • May damage the predictive power if the removed regressors have significant explanatory power. (Variable selection)
    • If we remove \(x_2\), we estimate the marginal relationship between \(y\) and \(x_1\), ignoring \(x_2\), rather than the partial relationship conditioning on \(x_2\).

Constraint on the model or in the population: Say goodbye to least-squares estimation.

  • Ridge Regression, Principal Component Regression, Bayesian Regression, etc

–> –> –>