Final Project Instructions

Overview

TL;DR: Pick a data set and do a regression analysis. That is your final project.

In this project, you will select a data set of interest to you, pose a research question that you will attempt to answer using multiple linear and/or multiple logistic regression, and write a paper in a formal scientific style. The data set can come from one of the pre-selected data sets made available on Canvas. However, if you have data from research you have conducted or another source which you are passionate about, you may use that data subject to instructor permission.

Logistics

There are four major deliverables for the final project:

  1. A written, reproducible report detailing your analysis to be submitted in either Microsoft word or a PDF and uploaded to Canvas.
  2. The .qmd and .html files from a Quarto appendix you create.
  3. Formal peer review on another team’s work and presentation feedback.
  4. A PDF of a poster to be uploaded to Canvas which you will present at during the last day of class.

However, due to the condensed nature of winter term, you will have small deliverable due on most days of class to make sure no one get’s left behind.

Utilizing the Librarians for Help with Research

From Christine Schutz:

An essential (and favorite) aspect of a librarian’s work is to provide assistance and coaching to students doing research for their projects and papers. Our two librarians - Christine Schutz and Lance McGrath – can help students to focus their source quests and locate appropriate and useful sources. Since students don’t always realize that not everyone who works in a library is a librarian (or even what that means) and that librarians aren’t in the library at all hours of the night (even though the library is open), please nudge them to reach out to us by email or Teams chat to set up a time to meet (or we can sometimes help over email/chat). From the history of the sufganiyot (this was last Fall’s favorite) to the Lamington crayfish (Euastacus sulcatus) and why they are blue (an all-time favorite of mine), student research projects are the best part of the job (and I know Lance feels the same way), please send those students our way.

You are more than welcome to contact me with any questions about your project but please do not hesitate reaching out to Christine or Lance with any questions.

Timeline & Grading

Checkpoint I: Select a partner (0%)

  • due Monday, January 6

Checkpoint II: Select a data set and write a research question (1%)

  • due Tuesday, January 7

Checkpoint III: Literature Review (1%)

  • due Wednesday, January 8

Checkpoint IV: Annotated Bibliography (1%)

  • due Thursday, January 9

Checkpoint V: Project Proposal (2%)

  • due Friday, January 10

Checkpoint VI: Load and Clean the Data Set (1%)

  • due Monday, January 13

Checkpoint VII: Generate Summaries (1%)

  • due Tuesday, January 14

Checkpoint VIII: Analyze Plots and Tables (1%)

  • due Wednesday, January 15

Checkpoint IX: Exploratory Data Analysis (2%)

  • due Thursday, January 16

Checkpoint X: Poster Presentation (50%)

  • due in Canvas by Friday, January 24th, 11:59am

  • posters presentations will be on TVs in CML (Library) 105 from 1pm-3:30pm on Friday, January 24th

Checkpoint XI: Supplementary Report (40%)

  • due in Canvas by Friday, January 24th, 11:59pm

Checkpoint I: Select a partner

Your project will be completed with a partner. Try and select a partner with similar interests to you. Anyone who has not selected a partner by the deadline will be assigned a partner.

Deliverables: Put yourselves in a group on Canvas.

Checkpoint II: Select a data set and write a research question

Select one of the pre-approved data set and identify the research question which will guide your project (e.g. “Do youth who participate in physical exercise class have lower BMI?”, “Are males more likely to drink and drive after adjusting for confounding variables?”) and briefly describe why your chosen project is interesting to you.

Deliverables: Word document identifying which data set you plan to use, the response variable of interest, and a brief description of why your chosen project is interesting to you.

Checkpoint III: Literature Review

Meet with Christine Schutz and find articles in the refereed literature that are relevant to your question of interest. You must find at least six. You should avoid articles that are too technical to be relevant to the project or to be informative for the non-specialist. You must read the abstracts for each of these papers. Be sure you obtain the entire paper and not just an abstract! You will eventually use these references in the introduction of your paper.

Deliverables: Word document containing the citation for each reference (in a standard format) and a link, if appropriate. Note that you are REQUIRED to meet with Christine to receive credit.

Checkpoint IV: Annotated Bibliography

Based on the abstracts, choose two of the papers you listed in your literature review. For these two papers, write a few sentences summarizing the primary findings and how they relate to your research question.

Deliverables: Take the word document you submitted for Checkpoint III and, below your two chosen articles, add a few sentences that summarize the primary findings and how they relate to your proposal.

Checkpoint V: Project proposal

Your proposal, to be turned in as a Word document or PDF on Canvas, will include the following. Note that some of the information below can be copied from previous Checkpoints:

  1. Identify the original data source. Include a brief summary of how, from whom, and by whom, the data were collected. Describe how the study design will impact the generalizability of any analysis.

  2. Identify the research question which will guide your project (e.g. “Do youth who participate in physical exercise class have lower BMI?”, “Are males more likely to drink and drive after adjusting for confounding variables?”).

  3. Provide a list of variables of interest and their definitions (including units). You should also include rationale for inclusion for each variable and identify the variable type, and whether it may need recoding. You should include at least 4 variables. A table is a good way to summarize this information; for example:

Variable Name Original Definition Units Range or Levels Possible recoding Rationale
bmi Body mass index Kg/m^2 >0 Response variable
pe How many days per week attends Physical education class Days/week 0-5, integers Currently categorical var. Recode to same values but numeric Main explanatory var of interest
age Age of student years 12-19, integers Possible con founding var
lunch Percent of students at the school receiving free /reduced lunch % 0-100 Possible con founding var (soc io-econ status)
  1. Your partially annotated bibliography. There must be at least six articles and at least two must be annotated.

Deliverables: Word document or PDF with the information above. You may receive points back on Checkpoints II-IV for your work on this assignment.

Checkpoint VI: Load and Clean the Data Set

We now begin the second stage of your project, the Exploratory Data Analysis portion.

You now need to load and “clean” your data set, make note of any problematic data and observations that need to be removed, determine whether you want to use the whole data set or a subset, and consider implications about any decisions you make about missing data. This website shares some simple approaches to missing data (and the relative advantages/disadvantages of each approach).

Deliverables: A QMD and HTML file uploaded to Canvas containing the following:

  • A chunk in which you load the data. See the Loading the Data section below. Note that you will be building on this document.
  • Any filtering that is necessary. At the very least this should include a treatment of your missing data.
    • A very brief discussion of whether your data represents a representative sample of your target population.
  • The number of variable and observations in your resulting data set.
  • Univariate plots of all your variables
    • No need to describe them yet but note any observations which seem troublesome.
  • Bivariate plots of your repose variables with all of your
    • No need to describe them yet but note any observations which seem troublesome.

Loading the Data

Nation Financial Well-Being Survey: Simply download the csv from Canvas, upload it to Quarto, and read it using the read_csv function in the same way as we’ve been doing with activities and homeworks.

Behavioral Risk Factor Surveillance Survey: The data set is extremely large which results in a few problems. It is too big for you to upload to RStudio yourself so I have already uploaded it to RStudio. Furthermore, if you load the entire data set into memory, RStudio will begin to run extremely slowly. As a result, it will be advantageous to only load the columns that you need. Use the code below to load only the columns that you need. Simply replace col_1, col_2, etc. with the names of your columns. You may add as many columns as you need, simple separate the names using commas.

read_csv("/srv/R/MAT212_WIN25/brfss2023.csv") |> 
  select(col_1, col_2, col_3)

Checkpoint VII: Generate Plots and Tables

Generate univariate and bivariate summaries of your data.

Deliverables: A QMD and HTML file uploaded to Canvas. You should start with the same file you used for your last checkpoint and add the following:

  • Univariate summaries:
    • A plot for every variable you are considering
    • A table of every variable you are considering
  • Bivariate summaries:
    • A plot comparing your response variable to all of your explanatory variables
    • Use the function ggpairs from the package GGally to display a grid of all bivariate plots for your data. Feel free to use any resource available to use without citing it (e.g. ChatGPT, tutor, Dr. F, etc…)
  • Note that you don’t need to analyze the plots yet… just create them.

Checkpoint VIII: Analyze Plots and Tables

Now that you’ve created all of your plots and tables, it’s time to start figuring out which ones are important and which ones you need. Each plot and table you have in your final EDA should have a purpose. That is, there should be something in that plot for you to talk about and point to. If not, then it’s not worth including. For this checkpoint I will have you go through all of your plots and tables, determine whether they are worth including, and summarize why you are including each plot. In general, plots are better than tables for visualizing patters, whereas tables are better if you want to reference specific numbers.

Deliverables: A QMD and HTML file uploaded to Canvas. You should start with the same file you used for your last checkpoint and add the following:

  • Under each plot and table:
    • Keep: Do you plan to keep this plot in your EDA. (If no, skip the rest). The answer to most of these should be no.
    • Key insights: What’s important piece(s) of information is in this plot/table and how does it relate back to your research question?
    • Redundancy: Is there information in the plot/table that is not evident in any of the other plots and tables? If the answer to this is no, why are you including it?
    • Medium: Why are you using a plot/table to display this information rather than a table/plot to display this information?

Checkpoint IX: Exploratory Data Analysis

You will now be synthesizing what you did in the preview three checkpoint into an exploratory data analysis report. You may receive points back on the previous three checkpoints for your work here. Note that you should not be fitting any models at this point.

Deliverables: Your EDA report, to be turned in on Canvas, will meet these guidelines:

In no more than 3 pages, summarize the main findings of your exploratory analysis, referring to specific plots and summary statistics where necessary. In addition, describe your plans for building models to address important research questions, including which variables will be important to consider in light of your exploratory analyses. This report should be meaningful and readable to someone familiar with statistics but unfamiliar with your particular research topic and dataset (i.e. your professor). Give concise but precise statements interpreting your plots, tables, summary statistics, etc. – in the context of your data set and research question you pose. Avoid vague terms like “this data”, “these results”, etc. Also avoid cryptic variable names that you may have used in R. A report like this might be something you’d share with collaborators or store as a reference as you proceed with your analysis.

  1. The Main Body of your EDA report should follow these guidelines:
    • No more than 3 pages
    • Be submitted as a word document or PDF and include no code.
    • Begin with a short paragraph introducing your project and primary research questions. (This introduction will be expanded into several paragraphs for the final paper.)
    • Your next paragraph should be about your data. You should address where it’s from, how the study was conducted, and whether it contains a representative sample of the population you’re interested in.
    • Use your graphical and numerical summaries to tell a story, supporting your conclusions with plots, tables, and summary statistics. Weave numerical summaries and graphs seamlessly into your text.
    • You do not need to include EVERY plot or table you made in the report but include at least 2 interesting plots/tables (if not more!). Name each figure (e.g. Figure 1) so they are easily referred to in your report. These exploratory plots/tables don’t have to be perfect in terms of titles and labels, but for your final paper it is essential that your figures have (meaningful) captions and axis labels!
    • Preview directions you plan to go with modeling. What models will you begin by fitting, and what variables will be involved. This should be the last paragraph of the report.
    • Write well! Complete sentences, good flow, proper grammar, etc.
  2. Your EDA report should also include an Annotated Appendix and References section (not included in the 3 page limit) which include these elements:
    • Clear definitions of important variables and the (properly cited) source of the data.
    • Tables and figures that are informative but were not referenced specifically in the main report. Include a short annotation – one or two sentences on what they show.
    • A citation for each reference article (in a standard format) you included in your proposal. Also include a link, if appropriate.
  3. All of the code you used to generate the plots and graphs for this EDA should be included in your previous checkpoint. However, you are welcome to make changes. If you do, please re-upload the code and HTML file to Checkpoint VIII so I can reference it if I need to.

Checkpoint X: Poster Presentation

The following is from Passion-Driven Statistics:

You have conducted a quantitative research project. Now you will learn how to present your results as a research poster and presentation. Posters offer the opportunity to engage with an audience and to meaningfully disseminate your research findings to others.

Learn keys to a successful poster presentation. Consider your audience and frame your research question and results in an understandable and interesting way. Understand you should be brief, use large font size, and incorporate graphics instead of text whenever possible. See how being clear and concise with a logical layout will ensure the viewing experience is intellectually and aesthetically satisfying for your audience. Click HERE to a watch the video lesson.

You will create a 3 column, 40” X 36” poster including an introduction, research questions, methods, results, and discussion.

Deliverables: Your poster, to be turned in on Canvas, will be a PDF. A rubric can be found here and poster template can be found on Canvas.

Checkpoint XI: Supplementary Report

Your supplementary report should… wait for it… supplment your poster presentation.

Deliverables: Your Supplementary Report report, to be turned in on Canvas, should be created in Canvas. A rubric can be found here.