MATH 4780 / MSSC 5780 Regression Analysis
Normal vs. COVID vs. Smoking
fake news vs. true news
yes
or no
, based on monthly credit card balance \((X)\).\[Y =\begin{cases} 0 & \quad \text{if not default}\\ 1 & \quad \text{if default} \end{cases}\]
What is the problem of this dummy variable approach?
default
using a S-shaped curve.What is a Bernoulli trial?
In the credit card example,
\(y_i \mid x_i \stackrel{indep}{\sim} Bernoulli(\pi(x_i)) = binomial(m=1,\pi = \pi(x_i))\)
balance
. \(x_1 = 2000\) has a larger \(\pi_1 = \pi(2000)\) than \(\pi_2 = \pi(500)\) with \(x_2 = 500\) because credit cards with a higher balance tend to be default.Instead of predicting \(y_i\) directly, we use the predictors to model its probability of success, \(\pi_i\).
But how?
\[\eta = logit(\pi) = \ln\left(\frac{\pi}{1-\pi}\right)\]
For \(i = 1, \dots, n\) and with one predictor \(X\): \[(Y_i \mid X = x_i) \stackrel{indep}{\sim} Bernoulli(\pi(x_i))\] \[\text{logit}(\pi_i) = \ln \left( \frac{\pi(x_i)}{1 - \pi(x_i)} \right) = \eta_i = \beta_0+\beta_1 x_{i}\]
Once we get the estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\), \[\small \hat{\pi}_i = \frac{\exp(\hat{\beta}_0+\hat{\beta}_1 x_{i} )}{1+\exp(\hat{\beta}_0+\hat{\beta}_1 x_{i})} = \frac{1}{1+\exp(-\hat{\beta}_0-\hat{\beta}_1 x_{i}))}\]
default student balance income
1 No No 730 44362
2 No Yes 817 12106
3 No No 1074 31767
4 No No 529 35704
5 No No 786 38463
6 No Yes 920 7492
7 No No 826 24905
8 No Yes 809 17600
9 No No 1161 37469
10 No No 0 29275
str(Default)
'data.frame': 10000 obs. of 4 variables:
$ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
$ balance: num 730 817 1074 529 786 ...
$ income : num 44362 12106 31767 35704 38463 ...
# A tibble: 2 × 2
default avg_balance
<fct> <dbl>
1 No 804.
2 Yes 1748.
logit_fit <- glm(default ~ balance, data = Default, family = binomial)
summ_logit_fit <- summary(logit_fit)
summ_logit_fit$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.6513 0.36116 -29.5 3.62e-191
balance 0.0055 0.00022 25.0 1.98e-137
The ratio \(\frac{\pi}{1-\pi} \in (0, \infty)\) is called the odds of some event.
\[\ln \left( \frac{\pi(x)}{1 - \pi(x)} \right)= \beta_0 + \beta_1x\] - Increasing \(x\) by one unit changes the log-odds by \(\beta_1\), or it multiplies the odds by \(e^{\beta_1}\).
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.6513 0.36116 -29.5 3.62e-191
balance 0.0055 0.00022 25.0 1.98e-137
balance
increases the log odds of default
by 0.0055 units.The odds ratio, \(\widehat{OR} = \frac{\text{odds}_{x+1}}{\text{odds}_{x}} = e^{\hat{\beta}_1} = e^{0.0055} = 1.005515\).
The odds of default
increases by 0.55% with additional one unit of credit card balance
.
\[\log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -10.651+0.0055\times 2000\]
\[ \hat{\pi} = \frac{1}{1+\exp(-(-10.651+0.0055 \times 2000)} = 0.586\]
pi_hat <- predict(logit_fit, type = "response")
eta_hat <- predict(logit_fit, type = "link") ## default gives us b0 + b1*x
predict(logit_fit, newdata = data.frame(balance = 2000), type = "response")
1
0.586
What is the probability of default when the balance is 500? What about balance 2500?
For \(i = 1, \dots, n\) and with \(k\) predictors: \[Y_i \mid \pi_i({\bf x}_i) \stackrel{indep}{\sim} \text{Bernoulli}(\pi_i), \quad {\bf x}_i' = (x_{i1}, \dots, x_{ik})\] \[\text{logit}(\pi_i) = \ln \left( \frac{\pi_i}{1 - \pi_i} \right) = \eta_i = \beta_0+\beta_1 x_{i1} + \cdots + \beta_k x_{ik} = {\bf x}_i'\boldsymbol \beta\]
\[\small E(Y_i) = \pi_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} = \frac{\exp(\beta_0+\beta_1 x_{i1} + \cdots + \beta_k x_{ik})}{1+\exp(\beta_0+\beta_1 x_{i1} + \cdots + \beta_k x_{ik})} = \frac{\exp( {\bf x}_i'\boldsymbol \beta)}{1 + \exp({\bf x}_i'\boldsymbol \beta)}\] \[\small \hat{\pi}_i = \frac{\exp(\hat{\beta}_0+\hat{\beta}_1 x_{i1} + \cdots + \hat{\beta}_k x_{ik})}{1+\exp(\hat{\beta}_0+\hat{\beta}_1 x_{i1} + \cdots + \hat{\beta}_k x_{ik})}\]
multi_logit_fit <- glm(default ~ balance + I(income/1000), data = Default,
family = binomial)
summ_multi_logit_fit <- summary(multi_logit_fit)
summ_multi_logit_fit$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.54047 0.434756 -26.54 2.96e-155
balance 0.00565 0.000227 24.84 3.64e-136
I(income/1000) 0.02081 0.004985 4.17 2.99e-05
\(\hat{\eta} = \text{logit}(\hat{\pi}) = \ln \left( \frac{\hat{\pi}}{1 - \hat{\pi}}\right) = -11.54 + 0.0056 \times \text{balance} + 0.021 \times \text{income}\)
One with a credit card balance of $1,500 and an income of $40,000 has an estimated probability of default of \[\hat{\pi} = \frac{1}{1+ \exp(-(-11.54 + 0.0056(1500) + 0.021(40)))} = 0.091\]
Why we multiply the income
coefficient by 40, rather than 40,000?
So far we consider only one binary outcome for each combination of predictors.
Often we have repeated observations or trials at each level of the regressors.
Originally, at \(X = x_i\), we have one single observation \(y_i = 0\) or \(1\).
Now, at \(X = x_i\), we have \(m_i\) trials and \(y_i\) of them are ones (successes).
Let \(y_{i,j}\) be an indicator (Bernoulli) variable taking value \(0\) or \(1\) for the \(j\)-th trial at \(x_i\).
\(y_i = y_{i,1} + y_{i,2} + \cdots + y_{i, m_i} = \sum_{j=1}^{m_i} y_{i,j}\).
What is our response and any distribution can be used to model that?
The responses \(Y_i \sim binomial(m_i, \pi_i)\).
Assume \(Y_1, Y_2, \dots, Y_n\) are independent. ( \(m_i\) trials, \(y_{i, 1}, \dots, y_{i, m_i}\), are independent too by the definition of a binomial experiment )
Number of trials | Number of successes | Regressors |
---|---|---|
\(m_1\) | \(y_1\) | \(x_{11}, x_{12}, \dots, x_{1k}\) |
\(m_2\) | \(y_2\) | \(x_{21}, x_{22}, \dots, x_{2k}\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
\(m_n\) | \(y_n\) | \(x_{n1}, x_{n2}, \dots, x_{nk}\) |
The compressive strength of an alloy fastener used in aircraft construction is being studied.
Ten loads were selected over the range 2500 – 4300 psi and a number of fasteners were tested at those loads.
The numbers of fasteners failing at each load were recorded.
binom_fit <- glm(cbind(y, m - y) ~ load, #<<
data = fastener, family = binomial)
binom_summ <- summary(binom_fit)
binom_summ$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.33971 0.545693 -9.79 1.30e-22
load 0.00155 0.000158 9.83 8.45e-23
\[\hat{\pi} = \frac{e^{-5.34 + 0.0015x}}{1 + e^{-5.34 + 0.0015x}}\]
0 | 1 | |
---|---|---|
Labeled 0 | True Negative (TN) | False Negative (FN) |
Labeled 1 | False Positive (FP) | True Positive (TP) |
Sensitivity (True Positive Rate) \(= P( \text{Labeled 1} \mid \text{1}) = \frac{TP}{TP+FN}\)
Specificity (True Negative Rate) \(= P( \text{Labeled 0} \mid \text{0}) = \frac{TN}{FP+TN}\)
Accuracy \(= \frac{TP + TN}{TP+FN+FP+TN}\)
More on Wiki page
No Yes
FALSE 9625 233
TRUE 42 100
caret::confusionMatrix()
yardstick::conf_mat()
library(ROCR)
# create an object of class prediction
pred <- ROCR::prediction(
predictions = pred_prob,
labels = Default$default)
# calculates the ROC curve
roc <- ROCR::performance(
prediction.obj = pred,
measure = "tpr",
x.measure = "fpr")
plot(roc, colorize = TRUE)
Find the area under the curve:
## object of class 'performance'
auc <- ROCR::performance(
prediction.obj = pred,
measure = "auc")
auc@y.values
[[1]]
[1] 0.948
Which model performs better?
Remember! Compare the candidates using the test data.