Assignment 2: Summarizing and Visualizing Data

Objectives

By the end of this assignment, you should:

understand how to read in data using read_csv
understand how to derive information from data (summarize, mutate, group_by)
understand how to make and modify a basic plot using ggplot

This assignment is due Thursday, September 23 at noon. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment.

To get started, you’ll need make a .Rmd document. You can start by using the template from the previous assignment and modifying it as appropriate (including title, name, etc). This assignment focuses on data from a recent paper examining the role that songs play in soothing infants (Bainbridge & Bertolo et al., 2021). To get started, you should give the paper a read and broadly understand the question the paper is trying to answer and the methods that they used (note that the methods are described in detail at the end of the paper). Note that for all the questions requiring code, you should use tidyverse functions.

[a] What is the main question that the experiment is designed to address?
[b] What did the experimenters manipulate? (the independent variable)
[c] What are three variables the researchers measured? (the dependent variables)
[d] The authors consider an alternative hypothesis for their heart rate results. What is this hypothesis?

In this assignment, we’ll focus on the heart rate data. You can download a lightly cleaned version of their heart rate data here:

bb2021 <- read_csv("https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/bb_2021_hr_clean.csv")

(if you’re curious, you can explore all their raw data by going to the repository associated with the paper, here).

There are seven variables in the data and each variable is described below. The first six rows of the data frame are also displayed below.

participant_id - Unique identifier for each infant.
age - Age of infant as continuous variable in months
age_cat - Age of each participant as discrete variable in months.
trial_type - Trial type (lullaby vs. non-lullaby). They also had “preference” trials in the experiment. Those trials are not included in this dataset.
trial_id - Trial identifier. Note that the number of trials varies across participants. For some participants there are data for 6 trials, while for others there are data for only 4.
obs_num - For each trial, they measured heart rate roughly every .4 seconds. This variable tells you which observation in a trial you’re looking at (.4 seconds after the trial started would be coded as 1, .8 seconds after the trial started would be coded as 2, etc.).
zhr_pt - This is the heart rate at a given observation, normalized relative to the previous trial.

Is this data tidy? What is the unit of observation?

Let’s start by trying to understand the structure of the dataset. Calculate the following:
[a] The age of the youngest (minimum) child in the dataset.
[b] The age of the oldest (maximum) child in the dataset.
[c] The total number of observations represented in the data.

bb2021 %>%
  summarize(youngest = min(age),
            oldest = max(age),
            total = n())

Create a dataframe with only the first observation from the first trial for each participant. Uses this dataframe to answer the following questions:
[a] How many participants are present in the dataset?
[b] How many 7-month-olds are there?
[c] Arrange the dataframe from youngest to oldest. What’s the participant_id for the youngest infant?
[d] Arrange the dataframe from oldest to youngest. What’s the participant_id for the oldest infant?

[a]

first_df <- bb2021 %>%
  filter(obs_num == 1, trial_id == 1) 

nrow(first_df)

[b]

first_df %>% 
  group_by(age_cat) %>% 
  summarize(n = n())

[c]

first_df %>% 
  arrange(age)

[d]

first_df %>% 
  arrange(-age)

What is the mean number of observations per trial? (hint: you’ll need to use both group_by and ungroup).

bb2021 %>%
  group_by(participant_id, trial_id) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  summarize(mean = mean(n))

How many observations are there in in the lullaby condition and the non-lullaby condition?

bb2021 %>%
  group_by(trial_type) %>%
  summarize(n = n())

Next, let’s examine the dependent variable, heart rate. Create a new variable called hr_round that is the heart rate value rounded to the nearest hundredth (use the function round()).

bb2021 <- bb2021 %>%
  mutate(hr_round = round(zhr_pt, 2))

Plot a histogram of hr_round using the geom, geom_histogram. Be sure to add an appropriate title to your plot.

bb2021 %>%
  ggplot(aes(x = hr_round)) +
  geom_histogram() +
  ggtitle("Heart Rate Distribution")

Calculate the mean heart rate for each participant on each trial type. Save it to a new dataframe called participant_means.

participant_means<- bb2021 %>%
  group_by(participant_id, trial_type) %>%
  summarize(mean = mean(hr_round))

## `summarise()` has grouped output by 'participant_id'. You can override using the `.groups` argument.

Use participant_means to create a violin plot showing the distribution of heart rates in the lullaby and non-lullaby conditions. Your plot should be a simplified version of Figure 2a in the paper with (a) two violins, (b) each violin a different color, and (c) points showing the underlying data. (hint: the order that you add geoms to your plot matters!).

ggplot(participant_means, aes(x = trial_type, y = mean, color = trial_type)) +
  geom_violin() +
  geom_point()

Use participant_means to calculate the overall means in the lullaby and non-lullaby conditions. Save this to a new dataframe called condition_means, and plot the two means as a bar plot of different colors. (hint: use geom_bar(stat = "identity")).

condition_means <- paricipant_means %>%
  group_by(trial_type) %>%
  summarize(mean = mean(mean))

ggplot(condition_means, aes(x = trial_type, y = mean, fill = trial_type))  +
  geom_bar(stat = "identity")

Create the plot below showing the mean heart rate by condition across trials. For extra credit, change the point size such that it corresponds to the number of trials represented.

trial_means <- bb2021 %>%
  group_by(trial_type, trial_id) %>%
  summarise(mean = mean(zhr_pt),
            n = n(), 
            .groups = "keep")  

ggplot(trial_means, aes(x = trial_id, y = mean, color = trial_type)) +
  geom_line() +
  geom_point() + #aes(size = n)
  ggtitle("Mean heart rate by trial number")

# Jonathan
bb2021 %>%
  group_by(trial_type, trial_id) %>%
  summarize(hr_means = mean(zhr_pt), num_trial = n_distinct(participant_id)) %>%
  ggplot(mapping = aes(x = trial_id, y = hr_means, color = trial_type, size = num_trial)) +
  geom_point() +
  geom_line(size = 0.5) +
  ggtitle("Mean heart rate by trial number") +
  ylab("mean")

# Nora
trial_means <- bb2021 %>%
  group_by(trial_id, trial_type) %>%
  summarize(mean_hr = mean(zhr_pt), total_trials = n_distinct(participant_id)) 

ggplot(trial_means, mapping = aes(x = trial_id, y = mean_hr, color = trial_type)) + 
  geom_point(mapping = aes(size = total_trials)) + 
  geom_line() + 
  ggtitle("Mean Heart Rate Change Per Trial #")

Choose your own geom! Look at the geoms on the ggplot cheatsheet and choose one to use with this data (other than geom_line, geom_point, geom_violin, and geom_bar). Make a beautiful, clear plot. Make sure to include a descriptive title.

# Emily (Visualizing the densities)
participant_means %>% 
  ggplot(mapping = aes(mean, color = trial_type)) +
  geom_freqpoly() +
  ggtitle("Frequency of Mean Heart Rates by Condition")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Iris (Comparison across ages)
bb2021_by_age <- bb2021 %>%
  group_by(age_cat, trial_type) %>%
  summarise(age_mean = mean(zhr_pt))

## `summarise()` has grouped output by 'age_cat'. You can override using the `.groups` argument.

ggplot(bb2021_by_age, mapping = aes(x = age_cat, y = age_mean, color = trial_type)) +
  geom_col(fill = "white") +
  ggtitle("Mean heart rate by age and trial type") +
  xlab("age (month)") +
  ylab("mean heart rate")

# Bethany (outlier customization)
participant_means %>% 
  ggplot(mapping = aes(x=trial_type, y=mean)) +
  geom_boxplot(outlier.colour = 'purple', outlier.size = 2, aes(fill=trial_type)) +
  scale_fill_brewer(palette = 'Set3', name='Trial Condition') +
  ggtitle(label='Average Participant Means by Trial Condition', subtitle = 'The mean heart rate across all trials for each participant based on trial condition') +
  xlab('Trial Condition') +
  ylab('Mean Heart Rate')

Assignment 2: Summarizing and Visualizing Data

Modern Research Methods