Homework 2 - Simple linear regression

Due Friday, September 22, 11:59 PM on D2L

Homework Instruction and Requirement

  • Homework 2 covers course materials of Week 1 to 4.

  • Please submit your work in one PDF file including all parts to D2L > Assessments > Dropbox. Multiple files or a file that is not in pdf format are not allowed.

  • In your homework, please number and answer questions in order.

  • Your answers may be handwritten on the Mathematical Derivation and Reasoning part. However, you need to scan your paper and make it a PDF file.

  • Your entire work on Statistical Computing and Data Analysis should be completed by any word processing software (Microsoft Word, Google Docs, (R)Markdown, Quarto, LaTex, etc) and your preferred programming language. Your document should be a PDF file.

  • Questions starting with (MSSC) are for MSSC 5780 students. MATH 4780 students could possibly earn extra points from them.

  • It is your responsibility to let me understand what you try to show. If you type your answers, make sure there are no typos. I grade your work based on what you show, not what you want to show. If you choose to handwrite your answers, write them neatly. If I can’t read your sloppy handwriting, your answer is judged as wrong.

Mathematical Derivation and Reasoning

The following questions are based on the population and sample linear regression model defined in our course slides and textbook.

  1. (MSSC) Find the least squares estimator \(b_0\) and \(b_1\) such that \[(b_0, b_1) = \arg\min_{\beta_0, \beta_1} \sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2.\]
  1. (MSSC) Remember that before training sample is collected, \(y_i\) are assumed random variables. Show that \(b_0\) is an unbiased estimator for \(\beta_0\), i.e., \(E(b_0) = \beta_0\).

  2. (MSSC) Show that \(\sum_{i=1}^n(y_i - \overline{y})^2 = \sum_{i=1}^n(\hat{y}_i - \overline{y})^2 + \sum_{i=1}^n(y_i - \hat{y}_i)^2\), i.e., \(SS_T = SS_R + SS_{res}\).

Statistical Computing and Data Analysis

Please perform a data analysis using \(\texttt{R}\) or your preferred language. Any results should be generated by computer outputs, and your work should be done entirely by your computer. Handwriting is not allowed. Relevant code should be attached.

Data Analysis

The data set mpg.csv presents data on the gasoline mileage performance of 32 different automobiles. (Table B.3 in the textbook LRA)

To import the data set into your R session, use read.csv() like

data_name_you_like <- read.csv("the_path_that_saves_your_data/mpg.csv")

Once you load the data set, type its name on the R console. The data should be a data frame with 32 rows and 12 columns that looks like

      y  x1  x2  x3   x4   x5 x6 x7    x8   x9  x10 x11
1 18.90 350 165 260 8.00 2.56  4  3 200.3 69.9 3910   1
2 17.00 350 170 275 8.50 2.56  4  3 199.6 72.9 3860   1
3 20.00 250 105 185 8.25 2.73  1  3 196.7 72.2 3510   1
4 18.25 351 143 255 8.00 3.00  2  3 199.9 74.0 3890   1
5 20.07 225  95 170 8.40 2.76  1  3 194.1 71.8 3365   0
6 11.20 440 215 330 8.20 2.88  4  3 184.5 69.0 4215   1

If this is what you get, you are good to start!

The variables are

  • \(y\): Miles per gallon
  • \(x_1\): Displacement (cubic in.)
  • \(x_2\): Horsepower (ft-lb)
  • \(x_3\): Torque (ft-lb)
  • \(x_4\): Compression ratio
  • \(x_5\): Rear axle ratio
  • \(x_6\): Carburetor (barrels)
  • \(x_7\): No. of transmission speeds
  • \(x_8\): Overall length (in.)
  • \(x_9\): Width (in.)
  • \(x_{10}\): Weight (lb)
  • \(x_{11}\): Type of transmission (A automatic; M manual)
  1. Fit a simple linear regression model relating gasoline mileage \(y\) (miles per gallon) to engine displacement \(x_1\) (cubic inches). Explain your coefficients. Any potential concern?

  2. Provide the \(95\%\) CI for \(\beta_0\), \(\beta_1\) and \(\sigma^2\).

  3. With \(\alpha = 0.05\), test if \(\beta_1\) is significantly different from 0. Provide procedure and steps, for example, \(H_0\) and \(H_1\), the test statistic or \(p\)-value, and decision rule.

  4. Construct the ANOVA table and test for significance of regression.

  5. What percent of the total variability in gasoline mileage is accounted for by the linear relationship with engine displacement?

  6. Find a \(95\%\) CI on the mean gasoline mileage if the engine displacement is 275 in\(^3\) engine.

  7. Suppose that we wish to predict the gasoline mileage obtained from a car with a 275 in\(^3\) engine. Give a point estimate of mileage. Find a 95% prediction interval (PI) on the mileage. Compare the PI with the CI in 6. Explain the difference between them. Which one is wider, and why?

  8. Plot data \(\{(x_{1i}, y_i)\}_{i=1}^{32}\), the fitted regression line, CI for \(\mu_{Y|X}\) and PI for \(y\) in one figure. Add appropriate labels of axes, title, and legend. [Hint: Create a sequence of values of \(x\), and obtain CI and PI for each value of \(x\). Use legend() to add legends to a plot.]

  9. Use the data and your fitted result to verify that

      1. \(\scriptstyle \sum_{i=1}^{32}(y_i - \hat{y}_i) = \sum_{i=1}^ne_i = 0\)
      1. \(\scriptstyle \sum_{i=1}^{32}y_i = \sum_{i=1}^{32}\hat{y}_i\)
      1. The LS regression line passes through the centroid \((\overline{x}, \overline{y})\)
      1. \(\scriptstyle \sum_{i=1}^{32}x_ie_i = 0\)
      1. \(\scriptstyle \sum_{i=1}^{32}\hat{y}_ie_i = 0\)