Homework 6 - Collinearity and Variable Selection

Due Friday, December 8, 11:59 PM on D2L

Homework Instruction and Requirement

  • Homework 6 covers course materials of Week 1 to 14.

  • Please submit your work in one PDF file including all parts to D2L > Assessments > Dropbox. Multiple files or a file that is not in pdf format are not allowed.

  • In your homework, please number and answer questions in order.

  • Your entire work on Statistical Computing and Data Analysis should be completed by any word processing software (Microsoft Word, Google Docs, (R)Markdown, Quarto, LaTex, etc) and your preferred programming language. Your document should be a PDF file.

  • It is your responsibility to let me understand what you try to show. If you type your answers, make sure there are no typos. I grade your work based on what you show, not what you want to show. If you choose to handwrite your answers, write them neatly. If I can’t read your sloppy handwriting, your answer is judged as wrong.

Reading and Writing (30 points for MATH 4780, and 50 points for MSSC 5780)

In this class, we have been using the classical or frequentist approach to do a variety of statistical inferences, marginal \(t\) test for \(\beta_j\) and \(F\) tests for model comparison for example. But Dr. Yu once said he never uses p-value in his own research, and there are many issues with what is taught in Intro Stats (MATH 1700/4720/4740). In fact, the null hypothesis significance testing (NHST) paradigm and the p-value usage have been much criticized and shown to be problematic, misused, and resulting in reproducibility and replication crisis in scientific research. Please write a summary paper (at least one page for MATH 4780 and two pages for MSSC 5780) including

  • Interpretation of p-value
  • List and discussion about the problems of the NHST and p-value method
  • Possible solutions to those problems

Some references are

There are lots of discussions and papers out there. You are welcome to google more resources to support your argument. The work should be entirely your effort. You are not allowed to copy anyone’s words, and you have to cite any resources you use, papers, blogs, videos, lecture notes, etc, or you violate Marquette academic misconduct policy.

Statistical Computing and Data Analysis

Please perform a data analysis using \(\texttt{R}\) or your preferred language. Any results should be generated by computer outputs, and your work should be done entirely by your computer. Handwriting is not allowed. Relevant code should be attached.

We use the same data set mpg.csv for data analysis. For the following analysis, if any regressor contains a missing value NA, remove the corresponding row of the data matrix.

  1. Build a linear regression model relating gasoline mileage \(y\) to vehicle weight \(x_{10}\) and the type of transmission \(x_{11}\) (1 automatic; 0 manual). Does the type of transmission significantly affect the mileage performance?

  2. Modify the model developed in (1) to include an interaction between vehicle weight and the type of transmission. What conclusions can you draw about the effect of the type of transmission on gasoline mileage? Interpret the parameters in this model.

For the following questions (3) to (7), consider all the regressors except \(x_4\), \(x_5\) and \(x_{11}\).

  • \(y\): MPG
  • \(x_1\): Displacement (cubic in.)
  • \(x_2\): Horsepower (ft-lb)
  • \(x_3\): Torque (ft-lb)
  • \(x_6\): Carburetor (barrels)
  • \(x_7\): No. of transmission speeds
  • \(x_8\): Overall length (in.)
  • \(x_9\): Width (in.)
  • \(x_{10}\): Weight (lb)

Standardize the response and predictors before later analysis.

  1. Obtain the correlation matrix of regressors. Does it give any indication of collinearity?

  2. Calculate the variance inflation factors (VIFs). Is there any evidence of collinearity?

  1. Find the eigenvectors associated with the smallest eigenvalues of the correlation matrix of regressors. Interpret the elements of these vectors. What can you say about the source of collinearity in these data?
  1. Use the all-possible-regressions approach to find an appropriate regression model.

  2. Use stepwise regression to specify a subset regression model. Does this lead to the same model found in (6)?