MATH 4780 / MSSC 5780 Regression Analysis
What are the assumptions in orderto use \(z\) or \(t\) intervals?
The population is Gaussian. If not, the sample size is large enough so that the central limit theorem can be applied!
We never answer this question in MATH 4720. What if the population is not Gaussian and the sample size \(n\) is small, i.e., CLT becomes powerless?
pulling oneself up by one’s bootstraps
which is a metaphor for accomplishing an impossible task without any outside help.
How much do you think it costs to rent a typical 1 bedroom apartment in Manhattan?
Sample median = $2350 😱
Population median = ❓
IDEA: We think the sample is representative of the population, so create an artificial population by replicating the subjects from the observed ones.
The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse
framework.
specify()
the variable of interest.manhattan |>
# specify the variable of interest
specify(response = rent) #<<
specify()
the variable of interest.generate()
a fixed number of bootstrap samples.manhattan |>
# specify the variable of interest
specify(response = rent) |>
# generate 15000 bootstrap samples
generate(reps = 15000, type = "bootstrap") #<<
specify()
the variable of interest.generate()
a fixed number of bootstrap samples.calculate()
the bootstrapped statistic(s).manhattan |>
# specify the variable of interest
specify(response = rent) |>
# generate 15000 bootstrap samples
generate(reps = 15000, type = "bootstrap") |>
# calculate the median of each bootstrap sample
calculate(stat = "median") #<<
specify()
the variable of interest.generate()
a fixed number of bootstrap samples.calculate()
the bootstrapped statistic(s).# save resulting bootstrap distribution
boot_sample <- manhattan |>
# specify the variable of interest
specify(response = rent) |>
# generate 15000 bootstrap samples
generate(reps = 15000, type = "bootstrap") |>
# calculate the median of each bootstrap sample
calculate(stat = "median")
👤 How many observations are there in boot_sample
? What does each observation represent?
dplyr::glimpse(boot_sample)
Rows: 15,000
Columns: 2
$ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ stat <dbl> 2350, 2350, 2350, 2350, 2350, 2350, 2350, 2550, 2350, 2350, …
bt_dist <- ggplot(data = boot_sample, mapping = aes(x = stat)) +
geom_histogram(binwidth = 50) +
labs(title = "Bootstrap distribution of medians") + theme_bw()
bt_dist
In simple linear regression, we
Bootstrap new samples \(\{x^b_j, y^b_j\}_{j=1}^n\) from the original sample \(\{x_i, y_i\}_{i=1}^n\).
Fit models to each of the bootstrapped samples and estimate the slope.
Use the distribution of the bootstrapped slopes to construct a confidence interval.
so on and so forth…
set.seed(2023)
boot_fits <- delivery_data |>
specify(time ~ cases) |>
generate(reps = 100, type = "bootstrap") |>
fit()
boot_fits
# A tibble: 200 × 3
# Groups: replicate [100]
replicate term estimate
<int> <chr> <dbl>
1 1 intercept 0.738
2 1 cases 2.54
3 2 intercept 2.24
4 2 cases 2.33
5 3 intercept 4.16
6 3 cases 2.14
# ℹ 194 more rows
observed_fit <- delivery_data |>
specify(time ~ cases) |>
fit()
boot_fits |> get_ci(point_estimate = observed_fit, type = "percentile")
# A tibble: 2 × 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 cases 1.73 2.48
2 intercept 0.485 6.81
boot_fits |> get_ci(point_estimate = observed_fit, type = "se")
# A tibble: 2 × 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 cases 1.68 2.68
2 intercept -0.341 6.98
car::Boot()
car::Boot()
is a simple interface to the boot::boot()
.hist(car_boot)
car::Confint()
boot::boot()