Homework 4 - Diagnostics

Due Friday, November 3, 11:59 PM on D2L

Homework Instruction and Requirement

  • Homework 4 covers course materials of Week 1 to 10.

  • Please submit your work in one PDF file including all parts to D2L > Assessments > Dropbox. Multiple files or a file that is not in pdf format are not allowed.

  • In your homework, please number and answer questions in order.

  • Your entire work on Statistical Computing and Data Analysis should be completed by any word processing software (Microsoft Word, Google Docs, (R)Markdown, Quarto, LaTex, etc) and your preferred programming language. Your document should be a PDF file.

  • Questions starting with (MSSC) are for MSSC 5780 students.

  • It is your responsibility to let me understand what you try to show. If you type your answers, make sure there are no typos. I grade your work based on what you show, not what you want to show. If you choose to handwrite your answers, write them neatly. If I can’t read your sloppy handwriting, your answer is judged as wrong.

Statistical Computing and Data Analysis

Please perform a data analysis using \(\texttt{R}\) or your preferred language. Any results should be generated by computer outputs, and your work should be done entirely by your computer. Handwriting is not allowed. Relevant code should be attached.

Diagnostics on Gasoline Mileage Data

We use the same data set mpg.csv for data analysis. Consider the multiple regression model \(y = \beta_0 + \beta_1x_1 + \beta_6x_6+\epsilon\) fit to the gasoline mileage data in your Homework 3.

  1. Compare R-student residuals \(t_i\) with Student-t \(t_{n-p-1}\) using a qqplot. Generate the histogram and density plot of \(t_i\) as well. Does there seem to be any problem with the normality assumption?

  2. Perform the Box-Cox method and discuss the necessity of any transformation on \(y\).

  3. Use \(\lambda = 0\) (log transformation) and the \(\lambda\) selected by the Box-Cox method to refit the model to the transformed data. Compare their R-student residuals with the R-student residuals from the non-transformed data using a boxplot.

  4. Construct a plot of the R-student residuals \(t_i\) versus the fitted responses and the Tukey’s spread-level plot. Any sign of violation of constant variance?

  5. In fact, we can perform some formal hypothesis testing on constant variance like \(H_0:\) Constant variance vs. \(H_1:\) Variance changes with \(E(y\mid x)\). Use car::ncvTest() to perform the test. Explain the testing result.

Influence Diagnostics on Squid Data

An experiment was conducted to study the size of squid eaten by sharks and tuna. The regressors are characteristics of the beak or month of the squid. The squid.csv data contain the variables

  • \(x_1\): Rostral length in inches
  • \(x_2\): Wing length in inches
  • \(x_3\): Rostral to notch length
  • \(x_4\): Notch to wing length
  • \(x_5\): Width in inches
  • \(y\): Weights in pounds

Perform a thorough leverage and influence diagnostics of the squid data.

  1. Compute R-studentized residuals, hat values, Cook’s distance, DFFITS, DFBETAS, and COVRATIO measures. Describe how you detect leverage and influential points. Discuss the effect of data points on coefficients, fitted values, and precision of coefficients. (The influence.measures() function provides all influence measures.)

  2. Let’s use some visualization tools.

    • Create the bubble plot.
    • Create the influence index plot (car::influenceIndexPlot())
  3. Produce the added-variable plot (car::avPlots()) for each regressor \(x_i, i = 1, \dots, 5\). Are there any joint influence of data points on the regression coefficients?

  4. (MSSC) Numerically verify the following properties of the added-variable plot.

    1. The slope of the least squares simple regression line of \(e(y \mid x_{(1)})\) on \(e(x_1 \mid x_{(1)})\) is the same as the least-squares slope \(b_1\) for \(x_1\) in the full multiple regression.
    2. The residuals from the simple regression \(e(y \mid x_{(1)})\) vs. \(e(x_1 \mid x_{(1)})\) are the same as the residuals \(e_i\) from the full multiple regression.
    3. The standard error of \(b_1\) is \(s / \sqrt{\sum_{i=1}^ne_i^2(x_1 \mid x_{(1)})}\), where \(s = \sqrt{MS_{res}}\) from the full multiple regression.