AE 06: Simulation-Based Inference (Bootstrap CI)

Bikeshare

Important

Open RStudio and create a subfolder in your AE folder called “AE-03”
Go to the Canvas and locate your AE 03 assignment to get started.
Upload the ae-03-sbi.qmd and dcbikeshare.csv files into the folder you just created.

library(tidyverse)
library(ggformula)
library(broom)

Data

Our dataset contains daily rentals from the Capital Bikeshare in Washington, DC in 2011 and 2012. It was obtained from the dcbikeshare data set in the dsbox R package.

We will focus on the following variables in the analysis:

count: total bike rentals
temp_orig: Temperature in degrees Celsius

bikeshare <- read_csv("../data/dcbikeshare.csv") |>
  mutate(season = case_when(
    season == 1 ~ "winter",
    season == 2 ~ "spring",
    season == 3 ~ "summer",
    season == 4 ~ "fall"
  ),
  season = factor(season))

Rows: 731 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (16): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
date  (1): dteday

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(bikeshare)

Rows: 731
Columns: 17
$ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ dteday     <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
$ season     <fct> winter, winter, winter, winter, winter, winter, winter, win…
$ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ weekday    <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
$ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
$ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
$ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
$ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
$ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
$ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
$ casual     <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
$ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
$ count      <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
$ temp_orig  <dbl> 14.110847, 14.902598, 8.050924, 8.200000, 9.305237, 8.37826…

Exercises

In this activity, each group will be assigned a season to use:

Groups 1 and 5: winter
Groups 2 and 6: spring
Groups 3 and 7: summer
Groups 4 and 8: fall

Exercise 0

Filter the bikeshare data set so that it only contains observations from your assigned season. Make sure to give the new data set a different name.

Exercise 1

Conduct a little EDA. Generate a scatter plot of the number of bike rentals vs the temperature.

Exercise 2

Fit a simple linear regression model, display the results. What is your “observed slope”.

Exercise 3

Load the mosaic package at the beginning of this document where the other packages are loaded. In general, it is good practice to keep all your packages at the beginning of your analyses.

Set a seed, take 1000 bootstrap samples using resample, and fit linear models to each one. This is called the sampling distribution. Where do you think the center of this distribution will be? How will it be different than the center of the null distribution?

Why do we set a seed before taking the bootstrap samples?

Exercise 4

Make a histogram of the bootstrap samples to visualize the bootstrap distribution. Optionally, highlight there area of the histogram corresponding to a 95% confidence interval. To do so, please load the coursekata packages.

# Code for histogram

Exercise 5

Discuss the following with your group and be prepared to share your thoughts with the class:

How is this process of resampling different than the process of shuffling we did for hypothesis testing?
Thinking about the goals of a hypothesis test vs the goals of a confidence interval, why are we using these two different techniques?

Exercise 6

Compute and interpret the 95% confidence interval as the middle 95% of the bootstrap distribution.

Exercise 7

Modify the code from Exercise 6 to create a 90% confidence interval. Before you run your code, do you expect the interval to get narrower or wider?

Exercise 8

Modify the code from Exercise 6 to create a 99% confidence interval. Before you run your code, do you expect the interval to get narrower or wider?

# Paste code for 90% confidence interval

Exercise 9

Which confidence level produces the most accurate confidence interval (90%, 95%, 99%)? Explain
Which confidence level produces the most precise confidence interval (90%, 95%, 99%)? Explain
If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval?

Exercise 10

If your sample size \(n\) is increased, what impact do you think this will have on accuracy and precision?
What about if you increase the number of bootstrapped replicates?

Exercise 11

Your professor is interested in calculating the average amount of time CofI students spend doing homework.

If he collects a set of data and asks 100 students to compute 95% confidence intervals from that data, how many of those would you expect to contain the true average?
If, instead, he has each of those 100 students collect their own data and compute 95% confidence intervals from their own data, how many would you expect to contain the true average?