Simulation-based Inference 💻

MATH 4780 / MSSC 5780 Regression Analysis

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Bootstrapping

Idea of Bootstrapping

Bootstrap Confidence Interval

Bootstrap Confidence Interval for Linear Regression

Randomization test for Linear Regression*

Why Simulation-based Inference?

  • Let’s go back to MATH 4720. How do we do interval estimation for population mean \(\mu\)?
  • \(z\)-interval when \(\sigma\) is known
  • Student’s \(t\)-interval when \(\sigma\) is unknown

What are the assumptions in orderto use \(z\) or \(t\) intervals?

The population is Gaussian. If not, the sample size is large enough so that the central limit theorem can be applied!

We never answer this question in MATH 4720. What if the population is not Gaussian and the sample size \(n\) is small, i.e., CLT becomes powerless?

Bootstrapping 👢

  • The term bootstrapping comes from the phrase

pulling oneself up by one’s bootstraps

which is a metaphor for accomplishing an impossible task without any outside help.

  • Impossible task: estimating a population parameter using data from only the given single sample dataset, without CI formula and replicates of data.

Source: http://dailywhiteboard.blogspot.com/2014/04/day-251-pull-yourself-up-by-your.html

Rent in Manhattan

How much do you think it costs to rent a typical 1 bedroom apartment in Manhattan?

Data

  • Consider 20 one-bedroom apartments that were randomly selected on Craigslist Manhattan from apartments listed as “by owner”.
manhattan <- readr::read_csv("./data/manhattan.csv")
range(manhattan)
[1] 1470 4195
glimpse(manhattan)
Rows: 20
Columns: 1
$ rent <dbl> 3850, 3800, 2350, 3200, 2150, 3267, 2495, 2349, 3950, 1795, 2145,…

Parameter of Interest

  • Could focus on the median rent or mean rent depending on our research goal.

Observed Sample vs. Bootstrap Population

Sample median = $2350 😱

Population median = ❓

IDEA: We think the sample is representative of the population, so create an artificial population by replicating the subjects from the observed ones.

Bootstrap Population

Source: Figure 12.1 of Introduction to Modern Statistics

Bootstrap Sampling

Source: Figure 12.2 of Introduction to Modern Statistics

Practical Bootstrap Sampling

Source: Figure 12.4 of Introduction to Modern Statistics

Bootstrapping Algorithm

  • [1] Take a bootstrap sample
    • a random sample taken with replacement from the original sample, of the same size as the original sample.
  • [2] Calculate the bootstrap statistic
    • a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples.
  • [3] Repeat steps (1) and (2) many times to create a bootstrap distribution
    • a distribution of bootstrap statistics.
  • [4] Calculate the bounds of the \((1-\alpha)100\%\) confidence interval
    • the middle \((1-\alpha)100\%\) of the bootstrap distribution.

R Lab infer 📦 in tidymodels

  • The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse framework.

  • https://infer.netlify.app/

R Lab Generate Bootstrap Samples of Medians

  • specify() the variable of interest.
manhattan |>
    # specify the variable of interest
    specify(response = rent) #<<

R Lab Generate Bootstrap Samples of Medians

  • specify() the variable of interest.
  • generate() a fixed number of bootstrap samples.
manhattan |> 
    # specify the variable of interest
    specify(response = rent) |>
    # generate 15000 bootstrap samples
    generate(reps = 15000, type = "bootstrap")  #<<

R Lab Generate Bootstrap Samples of Medians

  • specify() the variable of interest.
  • generate() a fixed number of bootstrap samples.
  • calculate() the bootstrapped statistic(s).
manhattan |> 
    # specify the variable of interest
    specify(response = rent) |>
    # generate 15000 bootstrap samples
    generate(reps = 15000, type = "bootstrap") |>
    # calculate the median of each bootstrap sample
    calculate(stat = "median")  #<<

R Lab Generate Bootstrap Samples of Medians

  • specify() the variable of interest.
  • generate() a fixed number of bootstrap samples.
  • calculate() the bootstrapped statistic(s).
  • save bootstrapping distribution for analysis.
# save resulting bootstrap distribution
boot_sample <- manhattan |> 
    # specify the variable of interest
    specify(response = rent) |>  
    # generate 15000 bootstrap samples
    generate(reps = 15000, type = "bootstrap") |>  
    # calculate the median of each bootstrap sample
    calculate(stat = "median")

R Lab The bootstrap Sample

👤 How many observations are there in boot_sample? What does each observation represent?

dplyr::glimpse(boot_sample)
Rows: 15,000
Columns: 2
$ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ stat      <dbl> 2350, 2350, 2350, 2350, 2350, 2350, 2350, 2550, 2350, 2350, …
  • Each replicate median rent is calculated from a bootstrapped sample of size 20 from our original sample data.

R Lab Visualize the Bootstrap Distribution

bt_dist <- ggplot(data = boot_sample, mapping = aes(x = stat)) +
    geom_histogram(binwidth = 50) +
    labs(title = "Bootstrap distribution of medians") + theme_bw()
bt_dist

R Lab Calculate the CI

bt_ci <- infer::get_ci(
  boot_sample, level = .95)
bt_ci
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    2162.     2875
Code
bt_dist + geom_vline(xintercept = c(bt_ci$lower_ci, bt_ci$upper_ci), 
                     color = "blue") +
    labs(subtitle = "and 95% CI") + theme_bw()

Bootstrapped CI for the Regression Slope

Quantify the Variability of the Slope

In simple linear regression, we

  • Bootstrap new samples \(\{x^b_j, y^b_j\}_{j=1}^n\) from the original sample \(\{x_i, y_i\}_{i=1}^n\).

  • Fit models to each of the bootstrapped samples and estimate the slope.

  • Use the distribution of the bootstrapped slopes to construct a confidence interval.

Bootstrap Sample 1

Bootstrap Sample 2

Bootstrap Sample 3

Bootstrap Sample 4

Bootstrap Sample 5

so on and so forth…

Bootstrap Samples 1 - 5

Bootstrap Samples 1 - 100

Slopes of bootstrap samples

Bootstrapped CI

  • A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.

R Lab Fit Regression to Bootstrap Samples

set.seed(2023)
boot_fits <- delivery_data |> 
  specify(time ~ cases) |> 
  generate(reps = 100, type = "bootstrap") |> 
  fit()
boot_fits
# A tibble: 200 × 3
# Groups:   replicate [100]
  replicate term      estimate
      <int> <chr>        <dbl>
1         1 intercept    0.738
2         1 cases        2.54 
3         2 intercept    2.24 
4         2 cases        2.33 
5         3 intercept    4.16 
6         3 cases        2.14 
# ℹ 194 more rows

R Lab Bootstrap CI

  • Percentile method: Compute the 95% CI as the middle 95% of the bootstrap distribution:
observed_fit <- delivery_data |> 
  specify(time ~ cases) |> 
  fit()
boot_fits |> get_ci(point_estimate = observed_fit, type = "percentile")
# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 cases        1.73      2.48
2 intercept    0.485     6.81
  • Standard error method: Compute the 95% CI as the point estimate \(\pm\) ~2 standard deviations of the bootstrap distribution:
boot_fits |> get_ci(point_estimate = observed_fit, type = "se")
# A tibble: 2 × 3
  term      lower_ci upper_ci
  <chr>        <dbl>    <dbl>
1 cases        1.68      2.68
2 intercept   -0.341     6.98

R Lab car::Boot()

lm_fit <- lm(time ~ cases, data = delivery_data)
car_boot <- car::Boot(lm_fit, R = 100)
car::brief(car_boot$t)
100 x 2 matrix (95 rows omitted)
       (Intercept) cases
  [1,]       1.159  2.40
  [2,]       3.537  2.11
  [3,]       0.961  2.35
. . .                        
 [99,]       2.757  2.24
[100,]       2.415  2.35

R Lab Histogram of Bootstrap Samples

  • BC\(_a\) is the accelerated bias-corrected percentile interval.
hist(car_boot)

R Lab car::Confint()

car::Confint(lm_fit, vcov. = vcov(car_boot))
Standard errors computed by vcov(car_boot) 
            Estimate  2.5 % 97.5 %
(Intercept)     3.32 -0.611   7.25
cases           2.18  1.613   2.74

R Lab boot::boot()