Simulation-based Inference 💻

MATH 4780 / MSSC 5780 Regression Analysis

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Bootstrapping

Idea of Bootstrapping

Bootstrap Confidence Interval

Bootstrap Confidence Interval for Linear Regression

Randomization test for Linear Regression*

Why Simulation-based Inference?

Let’s go back to MATH 4720. How do we do interval estimation for population mean $\mu$?

$z$-interval when $\sigma$ is known
Student’s $t$-interval when $\sigma$ is unknown

What are the assumptions in orderto use $z$ or $t$ intervals?

The population is Gaussian. If not, the sample size is large enough so that the central limit theorem can be applied!

We never answer this question in MATH 4720. What if the population is not Gaussian and the sample size $n$ is small, i.e., CLT becomes powerless?

Bootstrapping 👢

The term bootstrapping comes from the phrase

pulling oneself up by one’s bootstraps

which is a metaphor for accomplishing an impossible task without any outside help.

Impossible task: estimating a population parameter using data from only the given single sample dataset, without CI formula and replicates of data.

Source: http://dailywhiteboard.blogspot.com/2014/04/day-251-pull-yourself-up-by-your.html

Note: Notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.
The term bootstrapping comes from the phrase “pulling oneself up by one’s bootstraps”, which is a metaphor for accomplishing an impossible task without any outside help.
In statistics scenario, it means that we are estimating a population parameter using data from only the given single sample dataset, without CI formula and/or replicates of data.
I call it Impossible task, but that’s basically what statistics usually does.
So notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.

Rent in Manhattan

How much do you think it costs to rent a typical 1 bedroom apartment in Manhattan?

Data

Consider 20 one-bedroom apartments that were randomly selected on Craigslist Manhattan from apartments listed as “by owner”.

manhattan <- readr::read_csv("./data/manhattan.csv")
range(manhattan)

[1] 1470 4195

glimpse(manhattan)

Rows: 20
Columns: 1
$ rent <dbl> 3850, 3800, 2350, 3200, 2150, 3267, 2495, 2349, 3950, 1795, 2145,…

Parameter of Interest

Could focus on the median rent or mean rent depending on our research goal.

Observed Sample vs. Bootstrap Population

Sample median = $2350 😱

Population median = ❓

IDEA: We think the sample is representative of the population, so create an artificial population by replicating the subjects from the observed ones.

Bootstrap Population

Source: Figure 12.1 of Introduction to Modern Statistics

Bootstrap Sampling

Source: Figure 12.2 of Introduction to Modern Statistics

Practical Bootstrap Sampling

Source: Figure 12.4 of Introduction to Modern Statistics

Bootstrapping Algorithm

[1] Take a bootstrap sample
- a random sample taken with replacement from the original sample, of the same size as the original sample.
[2] Calculate the bootstrap statistic
- a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples.
[3] Repeat steps (1) and (2) many times to create a bootstrap distribution
- a distribution of bootstrap statistics.
[4] Calculate the bounds of the $(1-\alpha)100\%$ confidence interval
- the middle $(1-\alpha)100\%$ of the bootstrap distribution.

R Lab infer 📦 in tidymodels

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse framework.
https://infer.netlify.app/

So now we get the idea of BT. Let’s see how to do it in R.
You can write your own code to implement the BT.
But here, we use the infer package to do so. It is one of the tidymodels package.
The infer package performs statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.
So in R we have tidyverse for data wrangling and visualization, and we have tidymodels for inference and modeling.
The tidymodels framework is pretty young. It’s website was released last year. It is not very mature at this time, but I believe it will get more and more functionality and will be getting popular because its syntax is similar to tidyverse syntax.
The idea of using the infer package is basically shown in this workflow.
We start with specifying a hypothesis or response and predictors in general. In our case, we just specify the response variable of interest. Here our variable of interest is the rent.
And once the variable of interest is specified, we can start generating data and calculating the sample statistic we want. So in our case, we generate our data using BT, and we calculate the median.
And finally we can visualise the BT distribution and do the inference.

R Lab Generate Bootstrap Samples of Medians

specify() the variable of interest.

manhattan |>
    # specify the variable of interest
    specify(response = rent) #<<

R Lab Generate Bootstrap Samples of Medians

specify() the variable of interest.
generate() a fixed number of bootstrap samples.

manhattan |> 
    # specify the variable of interest
    specify(response = rent) |>
    # generate 15000 bootstrap samples
    generate(reps = 15000, type = "bootstrap")  #<<

R Lab Generate Bootstrap Samples of Medians

specify() the variable of interest.
generate() a fixed number of bootstrap samples.
calculate() the bootstrapped statistic(s).

manhattan |> 
    # specify the variable of interest
    specify(response = rent) |>
    # generate 15000 bootstrap samples
    generate(reps = 15000, type = "bootstrap") |>
    # calculate the median of each bootstrap sample
    calculate(stat = "median")  #<<

R Lab Generate Bootstrap Samples of Medians

specify() the variable of interest.
generate() a fixed number of bootstrap samples.
calculate() the bootstrapped statistic(s).
save bootstrapping distribution for analysis.

# save resulting bootstrap distribution
boot_sample <- manhattan |> 
    # specify the variable of interest
    specify(response = rent) |>  
    # generate 15000 bootstrap samples
    generate(reps = 15000, type = "bootstrap") |>  
    # calculate the median of each bootstrap sample
    calculate(stat = "median")

R Lab The bootstrap Sample

👤 How many observations are there in boot_sample? What does each observation represent?

dplyr::glimpse(boot_sample)

Rows: 15,000
Columns: 2
$ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ stat      <dbl> 2350, 2350, 2350, 2350, 2350, 2350, 2350, 2550, 2350, 2350, …

Each replicate median rent is calculated from a bootstrapped sample of size 20 from our original sample data.

R Lab Visualize the Bootstrap Distribution

bt_dist <- ggplot(data = boot_sample, mapping = aes(x = stat)) +
    geom_histogram(binwidth = 50) +
    labs(title = "Bootstrap distribution of medians") + theme_bw()
bt_dist

R Lab Calculate the CI

bt_ci <- infer::get_ci(
  boot_sample, level = .95)
bt_ci

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    2162.     2875

Code

bt_dist + geom_vline(xintercept = c(bt_ci$lower_ci, bt_ci$upper_ci), 
                     color = "blue") +
    labs(subtitle = "and 95% CI") + theme_bw()

Bootstrapped CI for the Regression Slope

Quantify the Variability of the Slope

In simple linear regression, we

Bootstrap new samples $\{x^b_j, y^b_j\}_{j=1}^n$ from the original sample $\{x_i, y_i\}_{i=1}^n$.
Fit models to each of the bootstrapped samples and estimate the slope.
Use the distribution of the bootstrapped slopes to construct a confidence interval.

Bootstrap Sample 1

Bootstrap Sample 2

Bootstrap Sample 3

Bootstrap Sample 4

Bootstrap Sample 5

so on and so forth…

Bootstrap Samples 1 - 5

Bootstrap Samples 1 - 100

Slopes of bootstrap samples

Bootstrapped CI

A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.

R Lab Fit Regression to Bootstrap Samples

set.seed(2023)
boot_fits <- delivery_data |> 
  specify(time ~ cases) |> 
  generate(reps = 100, type = "bootstrap") |> 
  fit()
boot_fits

# A tibble: 200 × 3
# Groups:   replicate [100]
  replicate term      estimate
      <int> <chr>        <dbl>
1         1 intercept    0.738
2         1 cases        2.54 
3         2 intercept    2.24 
4         2 cases        2.33 
5         3 intercept    4.16 
6         3 cases        2.14 
# ℹ 194 more rows

R Lab Bootstrap CI

Percentile method: Compute the 95% CI as the middle 95% of the bootstrap distribution:

observed_fit <- delivery_data |> 
  specify(time ~ cases) |> 
  fit()
boot_fits |> get_ci(point_estimate = observed_fit, type = "percentile")

# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 cases        1.73      2.48
2 intercept    0.485     6.81

Standard error method: Compute the 95% CI as the point estimate $\pm$ ~2 standard deviations of the bootstrap distribution:

boot_fits |> get_ci(point_estimate = observed_fit, type = "se")

# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 cases        1.68      2.68
2 intercept   -0.341     6.98

R Lab `car::Boot()`

car::Boot() is a simple interface to the boot::boot().

lm_fit <- lm(time ~ cases, data = delivery_data)
car_boot <- car::Boot(lm_fit, R = 100)
car::brief(car_boot$t)

100 x 2 matrix (95 rows omitted)
       (Intercept) cases
  [1,]       1.159  2.40
  [2,]       3.537  2.11
  [3,]       0.961  2.35
. . .                        
 [99,]       2.757  2.24
[100,]       2.415  2.35

R Lab Histogram of Bootstrap Samples

BC$_a$ is the accelerated bias-corrected percentile interval.

hist(car_boot)

R Lab `car::Confint()`

car::Confint(lm_fit, vcov. = vcov(car_boot))

Standard errors computed by vcov(car_boot)

            Estimate  2.5 % 97.5 %
(Intercept)     3.32 -0.611   7.25
cases           2.18  1.613   2.74

R Lab `boot::boot()`

https://stats.oarc.ucla.edu/r/faq/how-can-i-generate-bootstrap-statistics-in-r/
ISL Lab 5.3.4

Simulation-based Inference 💻

Bootstrapping

Idea of Bootstrapping

Bootstrap Confidence Interval

Bootstrap Confidence Interval for Linear Regression

Randomization test for Linear Regression*

Why Simulation-based Inference?

Bootstrapping 👢

Rent in Manhattan

Data

Parameter of Interest

Observed Sample vs. Bootstrap Population

Bootstrap Population

Bootstrap Sampling

Practical Bootstrap Sampling

Bootstrapping Algorithm

R Lab infer 📦 in tidymodels

R Lab Generate Bootstrap Samples of Medians

R Lab Generate Bootstrap Samples of Medians

R Lab Generate Bootstrap Samples of Medians

R Lab Generate Bootstrap Samples of Medians

R Lab The bootstrap Sample

R Lab Visualize the Bootstrap Distribution

R Lab Calculate the CI

Bootstrapped CI for the Regression Slope

Quantify the Variability of the Slope

Bootstrap Sample 1

Bootstrap Sample 2

Bootstrap Sample 3

Bootstrap Sample 4

Bootstrap Sample 5

Bootstrap Samples 1 - 5

Bootstrap Samples 1 - 100

Slopes of bootstrap samples

Bootstrapped CI

R Lab Fit Regression to Bootstrap Samples

R Lab Bootstrap CI

R Lab car::Boot()

R Lab Histogram of Bootstrap Samples

R Lab car::Confint()

R Lab boot::boot()

R Lab `car::Boot()`

R Lab `car::Confint()`

R Lab `boot::boot()`