library(tidyverse)
library(ggformula)
library(broom)
library(openintro)
library(kableExtra)
AE 03: Simulation-Based Inference
Bikeshare
Data
Our dataset contains daily rentals from the Capital Bikeshare in Washington, DC in 2011 and 2012. It was obtained from the dcbikeshare
data set in the dsbox
R package.
We will focus on the following variables in the analysis:
count
: total bike rentalstemp_orig
: Temperature in degrees Celsius
<- read_csv("../data/dcbikeshare.csv") |>
bikeshare mutate(season = case_when(
== 1 ~ "winter",
season == 2 ~ "spring",
season == 3 ~ "summer",
season == 4 ~ "fall"
season
),season = factor(season))
Rows: 731 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
date (1): dteday
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(bikeshare)
Rows: 731
Columns: 17
$ instant <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ dteday <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
$ season <fct> winter, winter, winter, winter, winter, winter, winter, win…
$ yr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mnth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ weekday <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
$ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
$ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
$ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
$ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
$ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
$ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
$ casual <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
$ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
$ count <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
$ temp_orig <dbl> 14.110847, 14.902598, 8.050924, 8.200000, 9.305237, 8.37826…
Exercises
In this activity, each group will be assigned a season to use:
- Group 1:
winter
- Group 2:
spring
- Group 3:
summer
- Group 4:
fall
Exercise 0
Filter the bikeshare
data set so that it only contains observations from your assigned season. Make sure to give the new data set a different name.
Exercise 1
Conduct a little EDA. Generate a scatter plot of the number of bike rentals vs the temperature for your season by filling in the blanks in the code below. What do you think alpha
does?
gf______(______, data = ______, alpha = 0.7) |>
gf_labs(
x = "Temperature (C)",
y = "Number of Bike Rentals",
)
Exercise 2
Fit a simple linear regression model, display the results to two decimal places, and be prepared to discuss the interpretation of the resulting slope and intercept.
<- lm(_____ ~ ______, data = _____)
model_fit
tidy(_____) |>
kable(digits = _____)
Exercise 3
Load the infer
package and calculate the observed fit (slope)
<- ____ |>
observed_fit specify(count ~ temp_orig) |>
fit()
observed_fit
Exercise 4
Take n bootstrap samples and fit models to each one.
Fill in the code, then set eval: true
.
<- 100
n set.seed(212)
<- ______ |>
boot_fits specify(______) |>
generate(reps = ____, type = "bootstrap") |>
fit()
boot_fits
- Why do we set a seed before taking the bootstrap samples?
Exercise 5 (Challenging)
Make a histogram of the bootstrap samples to visualize the bootstrap distribution.
# Code for histogram
Exercise 6
Compute and interpret the 95% confidence interval as the middle 95% of the bootstrap distribution.
Fill in the code, then set eval: true
.
get_confidence_interval(
boot_fits,point_estimate = _____,
level = ____,
type = "percentile"
)
Exercise 7
Modify the code from Exercise 6 to create a 90% confidence interval.
# Paste code for 90% confidence interval
Exercise 8
Modify the code from Exercise 6 to create a 90% confidence interval.
# Paste code for 90% confidence interval
Exercise 9
Which confidence level produces the most accurate confidence interval (90%, 95%, 99%)? Explain
Which confidence level produces the most precise confidence interval (90%, 95%, 99%)? Explain
If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval?
Exercise 10
If your sample size \(n\) is increased, what impact do you think this will have on accuracy and precision?
What about if you increase the number of bootstrapped replicates?
Exercise 11
Your professor is interested in calculating the average amount of time CofI students spend doing homework.
If he collects a set of data and asks 100 students to compute 95% confidence intervals from that data, how many of those would you expect to contain the true average?
If, instead, he has each of those 100 students collect their own data and compute 95% confidence intervals from their own data, how many would you expect to contain the true average?
Exercise 12
Write down your research question in words then state the null and alternative hypotheses in both words and mathematical notation. You can use dollar signs to engage “math mode”.
\[ H_0: \]
Exercise 13
Generate null distribution using permutation
Fill in the code, then set eval: true
.
= 100
n set.seed(212)
<- _____ |>
null_dist specify(______) |>
hypothesize(null = "independence") |>
generate(reps = _____, type = "permute") |>
fit()
Exercise 14
Visualize the null distribution. Does your slope seem unusual if we assume the null hypothesis is the truth?
# Code for histogram of null distribution
Exercise 15
Fill in the code below to compute the p-value. Have your reporter write the value on the board.
# get observed fit
<- _____ |>
observed_fit specify(____) |>
fit()
# calculate p-value
get_p_value(
____,obs_stat = ____,
direction = "two-sided"
)
Exercise 16
Do larger or smaller p-values provide evidence for the alternative hypothesis?
Do larger or smaller p-values provide evidence for your research question?
Interpret your p-value in the context of the problem. Do you think your data provides strong evidence for your research question?
To submit the AE:
Render the document to produce the HTML with all of your work from today’s class.
The driver for your group should upload your .qmd and .html files to the Canvas assignment.