AE 05: Simulation-Based Inference (HT Focus)

Bikeshare

Important

Open RStudio and create a subfolder in your AE folder called “AE-03a”
Upload the ae-03a-sbi-ht.qmd and dcbikeshare.csv files into the folder you just created.

library(tidyverse)
library(ggformula)
library(broom)
library(kableExtra)

Data

Our dataset contains daily rentals from the Capital Bikeshare in Washington, DC in 2011 and 2012. It was obtained from the dcbikeshare data set in the dsbox R package.

We will focus on the following variables in the analysis:

count: total bike rentals
temp_orig: Temperature in degrees Celsius

bikeshare <- read_csv("../data/dcbikeshare.csv") |>
  mutate(season = case_when(
    season == 1 ~ "winter",
    season == 2 ~ "spring",
    season == 3 ~ "summer",
    season == 4 ~ "fall"
  ),
  season = factor(season))

Rows: 731 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (16): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
date  (1): dteday

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(bikeshare)

Rows: 731
Columns: 17
$ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ dteday     <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
$ season     <fct> winter, winter, winter, winter, winter, winter, winter, win…
$ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ weekday    <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
$ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
$ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
$ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
$ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
$ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
$ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
$ casual     <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
$ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
$ count      <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
$ temp_orig  <dbl> 14.110847, 14.902598, 8.050924, 8.200000, 9.305237, 8.37826…

Exercises

In this activity, each group will be assigned a season to use:

Groups 1 and 5: winter
Groups 2 and 6: spring
Groups 3 and 7: summer
Groups 4 and 8: fall

Exercise 0

Filter the bikeshare data set so that it only contains observations from your assigned season. Make sure to give the new data set a different name.

Exercise 1

Conduct a little EDA. Generate a scatter plot of the number of bike rentals vs the temperature for your season by filling in the blanks in the code below. What do you think alpha does?

gf______(______, data = ______, alpha = 0.7) |> 
  gf_labs(
    x = "Temperature (C)",
    y = "Number of Bike Rentals",
  )

Exercise 2

Fit a simple linear regression model, display the results to two decimal places, and be prepared to discuss the interpretation of the resulting slope and intercept.

model_fit <- lm(_____ ~ ______, data = _____)

tidy(_____)

Exercise 3

Write down your research question in words then state the null and alternative hypotheses in both words and mathematical notation. You can use dollar signs to engage “math mode”.

\[ H_0: \]

Exercise 4

Fit a simple linear regression model, but use shuffle to randomize your response variable. Run the code below a few times and comment on how your answer changes.

Exercise 5

Generate the null distribution:

Fit a SLR using shuffle to randomize your response variable.
Use do to repeat the process 1000 times.
Save the result back into a data frame called null_dist.

Exercise 6

Plot the null distribution. Visually, does your observed slope look unusual? What does this imply about your research question?

Exercise 7

Answer the following questions about the null distribution:

What does each row in null_dist represent?
What does each row in bakeshare or your data set (e.g. winter, summer, etc.) represent?
How are these different?
How is the number of rows in null_dist different than the sample size?
If I were to increase the number of shuffles, how would you expact the plot in the previous exercise to change?
If I were to increase the sample size, (i.e. the number of days), how would you expect the null distribution to change?

Exercise 8

Fill in the code below to compute the p-value. Have your reporter write the value on the board.

Exercise 9

Do larger or smaller p-values provide evidence for the alternative hypothesis?
Do larger or smaller p-values provide evidence for your research question?
Interpret your p-value in the context of the problem. Do you think your data provides strong evidence for your research question?

Important

To submit the AE: - Render the document to produce the HTML with all of your work from today’s class. - The driver for your group should upload your .qmd and .html files to the Canvas assignment.