# load packageslibrary(tidyverse) # for data wrangling library(ggformula) # for visualizing datalibrary(broom) # for nicely displaying modelslibrary(mosaic) # for shufflinglibrary(scales) # for pretty axis labelslibrary(coursekata) # highlighting middle of distributionslibrary(knitr) # for neatly formatted tableslibrary(kableExtra) # also for neatly formatted tables# set default theme and larger font size for ggplot2ggplot2::theme_set(ggplot2::theme_bw(base_size =16))
Bootstrapped Confidence Intervals: Topics
Find range of plausible values for the slope using bootstrap confidence intervals
Exploratory data analysis
Code
heb <-read_csv(here::here("data/HEBIncome.csv")) |>mutate(Avg_Income_K = Avg_Household_Income/1000)gf_point(Number_Organic ~ Avg_Income_K, data = heb, alpha =0.7) |>gf_labs(x ="Average Household Income (in thousands)",y ="Number of Organic Vegetables", ) |>gf_refine(scale_x_continuous(labels =label_dollar()))
Modeling
heb_fit <-lm(Number_Organic ~ Avg_Income_K, data = heb)tidy(heb_fit)
Intercept: HEBs in Zip Codes with an average household income of $0 are expected to have -14.72 organic vegetable options, on average.
Slope: For each additional $1,000 in average household income, we expect the number of organic options available at nearby HEBs to increase by 0.96, on average.
From sample to population
For each additional $1,000 in average household income, we expect the number of organic options available at nearby HEBs to increase by 0.96, on average.
What is the goal of “statistical inference”?
What is the goal of a “hypothesis test”?
Confidence interval for the slope
Confidence interval
Confidence interval: plausible range of values for a population parameter
single point estimate \(\implies\) fishing in a murky lake with a spear
confidence interval \(\implies\) fishing with a net
We can throw a spear where we saw a fish but we will probably miss, if we toss a net in that area, we have a good chance of catching the fish
If we report a point estimate, we probably will not hit the exact population parameter, but if we report a range of plausible values we have a good shot at capturing the parameter
High confidence \(\implies\) wider interval (larger net)
Remember: single CI \(\implies\) either you hit parameter or you don’t
A confidence interval will allow us to make a statement like “For each $1K in average income, the model predicts the number of organic vegetables available at local supermarkets to be higher, on average, by 0.96, plus or minus X options.”
Should X be 1? 2? 3?
If we were to take another sample of 37 would we expect the slope calculated based on that sample to be exactly 0.96? Off by 1? 2? 3?
The answer depends on how variable (from one sample to another sample) the sample statistic (the slope) is
We need a way to quantify the variability of the sample statistic
Quantify the variability of the slope
for estimation
Two approaches:
Via simulation (what we’ll do today)
Via mathematical models (what we’ll do soon)
Bootstrapping to quantify the variability of the slope for the purpose of estimation:
Generate new samples by sampling with replacement from the original sample
Fit models to each of the new samples and estimate the slope
Use features of the distribution of the bootstrapped slopes to construct a confidence interval
Original Sample
Bootstrap sample 1
Bootstrap sample 2
Bootstrap sample 3
Bootstrap sample 4
Bootstrap sample 5
Bootstrap samples 1 - 5
Bootstrap samples 1 - 100
Slopes of bootstrap samples
Fill in the blank: For each additional $1k in average household income, the model predicts the number of organic vegetables available to be higher, on average, by 0.96, plus or minus ___.
Slopes of bootstrap samples
Fill in the blank: For each additional $1k in average household income, the model predicts the number of organic vegetables available to be higher, on average, by 0.96, plus or minus ___.
Confidence level
How confident are you that the true slope is between 0.8 and 1.2? How about 0.9 and 1.0? How about 1.0 and 1.4?
95% confidence interval
95% bootstrapped confidence interval: bounded by the middle 95% of the bootstrap distribution
We are 95% confident that for each additional $1K in average household income, the model predicts the number of organic vegetables options at local supermarkets to be higher, on average, by 0.81 to 1.31.
Computing the CI for the slope I
Calculate the observed slope:
observed_fit <-lm(Number_Organic ~ Avg_Income_K, data = heb)observed_fit
If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval?
Precision vs. accuracy
How can we get best of both worlds – high precision and high accuracy?
Changing confidence level
How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval?