AE 01: Multivariate Relationships

Important

For this exercise, you will work in groups, but everyone will work through the document and submit their solution.
Open RStudio and create a subfolder in your AE folder called “AE-01”
Go to the Canvas and locate your AE 01 assignment to get started.
Upload the ae-01.qmd files into the folder you just created. The .qmd and .html responses are due in Canvas no later than Monday, January 6th at 1:00pm.

Getting Started with R, RStudio, and Quarto

Quarto

The thing you are reading right now is an Quarto document (very similar to R Markdown if you’ve ever used that). Quarto runs inside R Studio. Quarto is a simple formatting syntax for authoring web pages, word documents, pdfs, and many more file types. You can find the link to a useful Quarto guide here.

When you click the Render button at the top of this window an html file will be generated that includes both content as well as the output of any embedded R code chunks within the document. Not only can you embed the R code, you can embed the output produced by the R code. In this way, your analysis is fully reproducible and updatable. All of your homework assignments must be prepared using Quarto.

The nice thing about Quarto is that you can write prose (as I am doing now), mathematical equations using LaTeX syntax (like \(y_i = a + b x_i\)), and R code/output/plots all in one synthesized document. This makes it approximately 10,000 times easier to use than doing the same thing in Word or LaTeX. As you go through this introduction, I recommend that you look also at the .qmd file and your output file side-by-side to get an idea of how Quarto works. (Since you’re going to have to use Quarto this semester, you might as well start learning it now!)

In the future, I will typically give you a Quarto template to fill out. However, occasionally you will need to create your own Quarto document. When you want to start a new Quarto document, click “File > New File > Quarto Document…” Put yourself as author, and make sure to give it a descriptive title!

Exercise 1

Click Render at the top of this document. Show Dr. Friedlander before moving on.

Using R as a Calculator and Running Code within a Quarto Document

With R by your side, you will never need your TI-84 again. Consider the simple arithmetic in the chunk below. You can run this code chunk all at once by clicking the Run button, the sideways green arrow in the top right of the chunk. Notice the interactive nature of the output. You can click through to see the different pieces of output.

As you work through this document, you should Run each chunk as you come to it.

Exercise 2

Run the code chunk below.

5 + 3

[1] 8

5.3 * 23.4

[1] 124.02

sqrt(16) # this is a comment.  R will not 'see' anything past a #, but you can use it to explain your code

[1] 4

Look closely at how Quarto denotes the R code and the output. Also note in the .qmd file how I include R code as separate from prose. These are called “chunks”. The easiest way to add a new chunk is to click on the green “c” icon with a plus in the corner above, then choose “R”. R code that is not inside of a chunk will not be run by Quarto!

Exercise 3

Create an R code chunk and compute the log of 10 using the function log. Why do you think the answer isn’t 1?

Exercise 4

You can also save values to named variables, to be used later:

product <- 15.3 * 23.4 #save result

If you save something like this, R will not show the output unless you expressly ask for it:

product #show result

[1] 358.02

The symbol “<-” is the assignment operator. If you’ve ever programmed before, it’s essentially the same as “=” in this instance. There are cases where “<-” works and “=” doesn’t, so it’s good to get into the habit of using “<-” now.

Once variables are defined, they can be referenced with other operators and functions. Try executing each line of code individually by placing your cursor on the first line of the chunk below and pressing Ctrl+Enter (Cmd +Enter for Mac users); then do the same for the second line. (This is how you can run a single line within a larger chunk.)

.5 * product # half of the product

[1] 179.01

product^(1/4) # fourth root of the product

[1] 4.349875

Exercise 5

You can also use in-line R code in Quarto, which can be useful when calling defined variables. Did you know that the natural log of 358.02 is 5.8805889?

Exercise 6

Only R code (and comments) should be inside chunks. Prose (interpretations/explanations/descriptions) should never be put inside a chunk; prose should be below or above the chunk, as I have done above (and continue to do throughout this document). You should also never cut-and-paste output or graphs into the chunks. The whole point of code chunks is that they contain the code and they’ll run the code (resulting in the output and/or graphs).

If you want to run something in R but don’t want it to appear in the Quarto document, simply run it in the Console in the lower-left quadrant of RStudio. Type the last line of R code above into the Console and see what happens.

Multivariate Relationships and Exploratory Data Analysis (EDA) with R

Carbohydrates and Starbucks

Starbucks often displays the total calories in their food items but not the other nutritional information. Carbohydrates are a body’s main fuel source. The Dietary Guidelines for America recommend that carbohydrates make up 45% to 65% of total daily calories. Our goal is to understand the relationship between the amount of carbohydrates and calories in Starbucks food items. We’d also like to assess if the relationship differs based on the type of food item (bakery, salad, sandwich, etc.)

You will now be introduced to the functions we will use most frequently for exploratory data analysis (EDA) in this class. If you are moderate to advanced R user, feel free to use whatever functions you’d like to accomplish the tasks. However, you should be prepared to answer questions about any uncited code which is not covered in this class on your oral exams. The first part of this document contains explanations for beginning R users, but all students should work through the entire document.

Important

In general, I suggest “rendering as you go”: rendering every few chunks, to make sure things are rendering correctly, rather than waiting until the end of a document to render the whole thing (and potentially encountering lots of errors that you have to unpack)! I also ALWAYS suggest “saving as you go”: it’s a good idea to save every couple of minutes. (This is good practice for all your files.)

Exercise 7

The code below loads the package called tidyverse which provides many functions that will be useful for manipulating data and the package called mosaic which is useful for teaching statistics. Add another line that loads a package called ggformula which we will use to plot our data and a third that loads the package openintro which will contain our data for this activity. Note that #| warning: FALSE causes the code to run without displaying any associated warnings when you render. You will often see #| eval: FALSE which prevents the code chunk from running when you render the document. Throughout this course, make sure you remove these lines or change them to #| eval: TRUE whenever you reach them. The reason that are included in the first place is so I can write scaffolded and fill-in-the-blank type exercises for you. These will cause errors (since they are incomplete code) until you finish writing them.

# load packages we typically use for this class.
library(tidyverse)
library(mosaic)
library(_________) # load ggformula
# load openintro

Loading data

Doing statistics requires data. Sometimes we will have to load data from a file and sometimes we can get our data from an R package. Today’s dataset, called starbucks is from the openintro package. To load a this data from within R you can use the data command.

Exercise 8

Use the code below to load the dataset starbucks, which is part of the openintro package. Don’t forget to change eval: FALSE to eval: TRUE… Why do you think we need eval: FALSE if you aren’t actually rewriting any code?

data("starbucks")

This is a great chance to remind you that R cares about letter case. This means that data("starbucks") works but data("Starbucks") doesn’t!

Inspecting the data source

Now you’re ready to learn a little bit about the starbucks data set.

Exercise 9

Run the code below and then edit the bulleted list below to add a short description in your own words describing what each function does. Hint: if it isn’t clear based on the output you can use ? before the function name in the console to bring up the documentation on that function. Notify Dr. Friedlander when you’re finished.

# Inspecting the data source
glimpse(starbucks)
head(starbucks)
names(starbucks)

glimpse() : this function…
head() : this function…
names() : this function…

Exercise 10

Based on the output above, how many observations and how many variables does this data set have?

Exercise 11

Let’s do a little bit of a data prep. In the console type ?starbucks and look at the definition of type.

The following is a little bit of data wrangling to get the source data in shape for our purposes. What do you think this code is doing? Make sure you change eval: FALSE to eval: TRUE.

starbucks <- starbucks |> 
  mutate(bakery = factor(if_else(type == "bakery", "bakery", "non-bakery")))

Statisticians should always know something about the data domain in order to be useful. Why do you think we are changing our data like this?

Exploratory Data Analysis

For the purposes of our class, it’s useful to learn a model-centric approach to R. The pseudo-code below is going to be our foundation for the rest of the class:

function( Y ~ X, data = DataSetName )

Here’s a short description of each part in the pseudo-code above:

function is an R function that dictates something you want to do with your data; for example,
- mean calculates the mean
- lm fits a linear regression model
Y is the outcome of interest (response variable)
X is some explanatory variable; you can use 1 as a placeholder if there is no explanatory variable
DataSetName is the name of a data set loaded into the R environment

Always start with clear research questions. Our question for this exercise:

How do the number of carbs compare for bakery and non-bakery items?

The purpose of the exploratory data analysis (EDA) is to learn as much as you can about your research question before doing any statistical modeling. We basically want to try and answer the research question with EDA if possible…or at least have a guess as to what the answer “should” be. Then we use statistical models to formally accommodate variability in the data and calculate the uncertainty of our conclusions.

Exercise 12

Use the R code chunk below to calculate the mean carbs by for bakery and non-bakery items. Summarize your observations below the code chunk.

mean(carb ~ bakery, data = starbucks)

Exercise 13

Of course, there are lots of other ways to summarize a numerical variable besides the mean. Use the R code chunk to calculate the other summary statistics for the number of carbs for bakery and non-bakery items using the function favstats. Summarize your observations below the code chunk.

(Hint: Do you not know how favstats works? Well, it’s a function just like any other: it follows the syntax described at the top of this section! Also, you can always find details and examples by searching the Help menu in the lower-right quadrant.)

Exercise 14

Let’s generate some univariate (i.e. one-variable) EDA plots for our three variables. In general, can use histograms or boxplots (gf_histogram or gf_boxplot) for quantitative variables and bar charts gf_bar for categorical variables. Fill in the blanks in the code below to generate the corresponding plots. Then create a boxplot of carb. Finally, share your observations below the code:

# make a histogram of carb
gf_histogram(~ carb, data = starbucks)

# make a histogram of calories
gf_histgoram(~ _____, data = ______)

# make a bar chart of bakery
gf_bar(~ _____, data = _____)

# make a boxplot of carb

Exercise 15

Let’s not generate some bivariate (i.e. two-variable) EDA plots for our three variables. For two-quantitative variables we typically use a scatter plot gf_point. If we are comparing a quantitative and a categorical variable will use side-by-side boxplots gf_boxplot. Generate and plot to visualize carb and calories and one to visualize carb and bakery. Comment on your observations.