MATH 4780 - Fall 2023 - Overview of Regression r emo::ji('book')

Relationship as Functions

Represent relationships between variables using functions $y = f(x)$.
- Plug in the inputs and receive the output.
- $y = f(x) = 3x + 7$ is a function with input $x$ and output $y$.
- If $x = 5$, $y = 3 \times 5 + 7 = 22$.

Different Relationships

Can you come up with any real-world examples describing relationships between variables deterministically?

Different Relationships

Relationship between Variables is Not Perfect

Can you provide some real examples that the variables are related each other, but not perfectly related?

But unfortunately, in reality, most relationships between variables we are interested are not Perfect.
In fact, there is almost no perfect relationship between two variables because everything in the world is connected each other. And so any two variables are affected by any other variables.
For example, the displacement of an object is also affected by airflow and humidity, which is not considered in the formula.
Even we the two variables are perfectly related, there are always some measurement errors or noises when we are recording and collecting their data. right. For example, there may be some measurement errors when we measure the displacement of an object for any given time, right? So their quadratic relationship is there, but the data we collect are not exactly on the quadratic curve.
So that’s why the Relationship between variables always involves some uncertainty.

Relationship between Variables is Not Perfect

💵 In general, one with more years of education earns more.
💵 Any two with the same years of education may have different annual income.

Variation around the Function/Model

What are the unexplained variation coming from?

Other factors accounting for parts of variability of income.
- Adding more explanatory variables to a model can reduce the variation size around the model.
Pure measurement error.
Just that randomness plays a big role. 🤔

What other factors (variables) may affect a person’s income?

your income = f(years of education, major, GPA, college, parent's income, ...)

Regression Model

$Y$: response, outcome, label, dependent variable, e.g., income
$X$: predictor, covariate, feature, regressor, explanatory/ independent variable, e.g., years of education, which is known and fixed.

Explain the relationship between $X$ and $Y$ and make predictions through a model \[Y = f(X) + \epsilon\]
$\epsilon$: irreducible random error
- independent of $X$
- mean zero with some variance.
$f(\cdot)$: fixed but unknown function describing the relationship between $X$ and the mean of $Y$.

In Intro Stats, what is the form of $f$ and what assumptions you made on the random error $\epsilon$ ?

$f(X) = \beta_0 + \beta_1X$ with unknown parameters $\beta_0$ and $\beta_1$.
$\epsilon \sim N(0, \sigma^2)$.

OK. Now after collecting the data of the variables we are interested, we know their relationship, most of the time, is not perfect, and stochastic in some way and in some sense.
And how do we model such stochastic relationship? Well the answer is a regression model.
Suppose we are interested in the relationship between two variables, call $X$ and $Y$. In particular, we like to know how changes of $X$ affect value of $Y$, or we want to use $X$ to predict $Y$.
In this sense, $Y$ is called response, outcome, label, dependent variable, e.g., income
$X$ is called predictor, covariate, feature, regressor, explanatory or independent variable, e.g., years of education, which is known and fixed.
Explain the relationship between $X$ and $Y$ and make predictions through a model $Y = f(X) + \epsilon$. This is a very general regression model we can built to learn the relationship b/w x and y.
$f(\cdot)$ is fixed but unknown and describes the true relationship between $X$ and $Y$.
$\epsilon$ is a irreducible random error which is assumed to be independent of $X$ and has mean zero with some variance.
$\epsilon$ is used to represent those measurement errors or the variation that cannot be explained or captured by the predictor X.
Intro Stats:
- $f(X) = \beta_0 + \beta_1X$ with unknown parameters $\beta_0$ and $\beta_1$.
- $\epsilon \sim N(0, \sigma^2)$.
$X$ and $Y$ are assumed to be linearly related, which may not be correct.
Next week, we will learn simple linear regression from the scratch and in much more detail. Here I just give you an overview.

True Unknown Function $f$ of the Model $Y = f(X) + \epsilon$

Blue curve: true underlying relationship between (the mean) income and years of education.
Black lines: error associated with each observation

Big problem: $f(x)$ is unknown and needs to be estimated.

Why Estimate $f$? Prediction for $Y$

Prediction: Inputs $X$ are available, but the output $Y$ cannot be easily obtained. We predict $Y$ using \[ \hat{Y} = \hat{f}(X), \] where $\hat{f}$ is our estimate of $f$, and $\hat{Y}$ represents the resulting prediction for $Y$.

Now we know our goal is to estimate the unknown regression function $f$. But why do we need that? Any benefits we can have after estimating $f$?
Well, the two main benefits of doing regression are
- first, we are able to predict y given a value of x
- the second benefit is, we learn how x affects y, which may be our research interest.
Let’s discuss prediction first.
Regression can be a great tool for prediction, especially when inputs $X$ are available, but the output $Y$ cannot be easily obtained.
For example, we usually know people’s years of education, but we rarely know their income level, right? because income is a private and personal information.
After we estimate $f$, we can use the estimate of $f$ to predict $y$,
So we predict $Y$ using the relationship between X and Y that we learn from our data that

In Intro Stats, what is our estimated regression function $\hat{f}$?

$\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1X$.

$\hat{f}$ is often treated as a black box.

Why Estimate $f$? Inference for $f$

Inference: Understand how $Y$ is affected by $X$.
$\hat{f}$ cannot be treated as a black box. We want to know the exact form of $f$.

We are interested in

Which covariates are associated with the response?
👉 Do age, education level, gender, etc affect salary?

What is the relationship between the response and each covariate?
👉 How much salary increases/decreases as age increases one unit?

Can the relationship be adequately summarized using any equation?
👉 The relationship between salary and age is linear, quadratic or more complicated?

Another benefit of estimating y is that we can do inference for the regression function $f$.
We estimate $f$ so that we understand how $Y$ is affected as $X$ changes.
When the goal is inference, $\hat{f}$ cannot be treated as a black box.
We want to know its exact form of $f$.
When doing inference, we are interested in:
- Which covariates are associated with the response? e.g. Do age, education level, gender, etc affect salary?
- What is the relationship between the response and each covariate? e.g. How much salary increases/decreases as age increases one unit?
- Can the relationship between $Y$ and each covariate be adequately summarized using a linear equation, or is the relationship more complicated? e.g. The relationship between salary and age is linear, quadratic or more complicated?

All those questions may be your research questions, and these can be answered by regression analysis.

How to Estimate $f$?

Observations $\{(x_1, y_1), (x_2, y_2), \dots, (x_n,y_n)\}$: training data to train or teach our model to learn $f$.
Use test data to test or evaluate how well the model makes inference or prediction.

Models are characterized as either parametric or nonparametric.

Parametric methods involve a two-step model-based approach:
- 1️⃣ Make an assumption about the functional form, or shape of $f$, e.g. linear regression \[ f(X) = \beta_0 + \beta_1X_1 \]
- 2️⃣ Use the training data to fit or train the model, e.g., estimate the parameters $\beta_0, \beta_1$ using (ordinary) least squares.

Generally, regression models can be characterized as either parametric or non-parametric.
And parametric methods involve a two-step model-based approach:
- [1] First we make an assumption about the functional form, or shape of $f$. For example, we can assume the function form or the relationship between y and predictors is linear, so the function $f(X)$ is a linear function $f(X) = _0 + _1X_1 + _2X_2 + , _pX_p $.
- [2] And then we use the training data to fit or train the model. For example, estimate the parameters $\beta_0, \beta_1, \beta_2 \dots, \beta_p$ using (ordinary) least squares.
Nonparametric methods, on the other hand, do not make assumptions about the functional form of $f$.
They basically seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. So the idea is, we try to make the function looks like data scatter pattern, and the function is a smoothed version of the data.

Nonparametric methods do not make assumptions about the shape of $f$.
- Seek an estimate of $f$ that gets close to the data points without being too rough or wiggly.

Parametric vs. Nonparametric Models

Parametric (Linear regression)

Nonparametric (LOESS)

Parametric vs. Nonparametric Models

Parametric: Assumptions on $f$ with unknown parameters.
Nonparametric: No assumptions on $f$ (may have no closed form).

Linear vs. Nonlinear Models

Linear Regression: $Y$ is linear in unknown parameters, NOT predictors.
- $Y = \beta_0 + \beta_1X + \epsilon$
- $Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon$
- $Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3\sqrt{X} + \epsilon$

A nonlinear relationship between $X$ and $Y$ can be modeled using a linear regression.

Nonlinear Regression: $Y$ is NOT linear in unknown parameters.
- $Y = \frac{\beta_0}{1 + e ^{-\beta_1X}} + \epsilon$
- $Y = \beta_0e ^{-\beta_1X}\cdot\epsilon$

👉 Some nonlinear models can be transformed to an equivalent linear model.

Which nonlinear model above can be transformed into a linear model?

Linear vs. Nonlinear Models

Overview of Regression 📖

What is Regression

Relationship as Functions

Different Relationships

Different Relationships

Relationship between Variables is Not Perfect

Relationship between Variables is Not Perfect

Variation around the Function/Model

Regression Model

True Unknown Function \(f\) of the Model \(Y = f(X) + \epsilon\)

Why Estimate \(f\)? Prediction for \(Y\)

Why Estimate \(f\)? Inference for \(f\)

How to Estimate \(f\)?

Parametric vs. Nonparametric Models

Parametric vs. Nonparametric Models

Linear vs. Nonlinear Models

Linear vs. Nonlinear Models