Assignment 3: More practice with the tidyverse

Objectives

The primary objective of this assignment is to give you more practice with:

using the basic dplyr functions (filter, arrange, select, summarize, mutate, group_by)
creating and modifying basic plots using ggplot

You should also:

understand how to select rows from a dataset using slice()
understand how to get unique rows using distinct()
understand how to facet plots with facet_wrap() and facet_grid()
understand how to join dataframes
understand how to convert between “wide” and “long” data structures

This assignment is due Thursday, September 30th at noon. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment.

This assignment concerns a dataset from an experiment that tested whether 2-4 year-old children could learn new words from exclusion (Lewis, Cristiano, Lake, Kwan & Frank, 2020).

There were two conditions. In the critical condition, children saw two objects. One of the objects was an object that the child knew the label for (e.g., a ball) and the other object was an object that the child did not know the label for (e.g., tongs). The experimenter then asked the child to point to the novel object by saying, e.g., “Can you find the tongs?”. If the child assumes that each object only has one name, they should assume that this new label refers to the tongs, and not the ball. This phenomenon is called “Mutual Exclusivity” in the literature (Markman & Wachtel, 1988), because children are thought to assume that a new label is mutually exclusive with an old one. Let’s call this condition the “Novel-Familiar” condition, or NF.

In the control condition, children again saw two objects. This time both of the objects were objects that the child knew a label for (e.g., a ball and a cup). The experimenter then asked the child to point to one of the objects by saying, e.g., “Can you find the ball?”. Let’s call this condition the “Familiar-Familiar” condition, or FF.

Each child completed 7 trials: 4 in the NF condition and 3 in the FF condition. On each trial we recorded which object was the correct choice, and whether or not the child pointed to the correct object. We also measured two variables for each child: The age of the child and their performance on an separate vocabulary test.

Each variable in the dataset is described below:

sub_id - unique identifier for each participant in our dataset.
trial_num - trial number.
age_years - age of the child in years.
age_months - age of the child in months.
vocabulary_score - score (out of 100) on vocabulary test.
condition - NF or FF trial
target_object - correct object (e.g., ball, cup, tongs, etc)
correct - whether or not the child selected the correct object (TRUE/FALSE)

Here is the path to a lightly cleaned version of the dataset:

DATA_PATH <- "https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/tidy_me_data.csv"

Load the data frame and save it to a variable called me_data. Use the glimpse() function to determine:
[a] how many observations there are in the data frame,
[b] the variable type of sub_id, and
[c] the variable type of target_object.

me_data <- read_csv(DATA_PATH)

## Rows: 1274 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): condition, target_object
## dbl (5): sub_id, trial_num, age_years, age_months, vocabulary_score
## lgl (1): correct

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(me_data)

## Rows: 1,274
## Columns: 8
## $ sub_id           <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ trial_num        <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4,…
## $ age_years        <dbl> 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ age_months       <dbl> 47.04, 47.04, 47.04, 47.04, 47.04, 47.04, 47.04, 30.3…
## $ vocabulary_score <dbl> 72.73, 72.73, 72.73, 72.73, 72.73, 72.73, 72.73, 59.0…
## $ condition        <chr> "FF", "FF", "FF", "NF", "NF", "NF", "NF", "FF", "FF",…
## $ target_object    <chr> "balloon", "bowl", "apple", "helm", "beaker", "gourd"…
## $ correct          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…

There are 1274 observations. sub_id is numeric/a double. target_objrect is a character/string.

[a] Use slice() to print rows 1 and 3 from me_data.
[b] Use arrange and slice() to print 7 rows of the first trial (where trial_num is 1).

me_data %>%
  arrange(trial_num) %>%
  slice(1,3)

## # A tibble: 2 × 8
##   sub_id trial_num age_years age_months vocabulary_score condition target_object
##    <dbl>     <dbl>     <dbl>      <dbl>            <dbl> <chr>     <chr>        
## 1      1         1         3       47.0             72.7 FF        balloon      
## 2      3         1         3       41.6             77.3 FF        cookie       
## # … with 1 more variable: correct <lgl>

me_data %>%
  arrange(trial_num) %>%
  slice(1:7)

## # A tibble: 7 × 8
##   sub_id trial_num age_years age_months vocabulary_score condition target_object
##    <dbl>     <dbl>     <dbl>      <dbl>            <dbl> <chr>     <chr>        
## 1      1         1         3       47.0             72.7 FF        balloon      
## 2      2         1         2       30.4             59.1 FF        cup          
## 3      3         1         3       41.6             77.3 FF        cookie       
## 4      4         1         3       41.5             72.7 FF        guitar       
## 5      5         1         3       42.4             68.2 FF        guitar       
## 6      6         1         2       32.3             63.6 FF        apple        
## 7      7         1         3       38.0             68.2 FF        apple        
## # … with 1 more variable: correct <lgl>

[a] How many children participated in our experiment?
[b] How many children participated in our experiment who were at least three-and-a-half years of age?

me_data %>%
  distinct(sub_id) %>%
  nrow()

## [1] 185

me_data %>%
  filter(age_months >= 42) %>%
  distinct(sub_id) %>%
  nrow()

## [1] 52

There are 185 children in the experiment. There are 52 children who are at least three and half years of age.

How many individual trials were there where the target object was “balloon”, “apple” or “guitar”?
[a] Use group_by to answer this question.
[b] Use count to answer this question.

me_data %>%
  filter(target_object %in% c("balloon", "apple", "guitar")) %>%
  group_by(target_object) %>%
  summarize(n = n())

## # A tibble: 3 × 2
##   target_object     n
##   <chr>         <int>
## 1 apple            49
## 2 balloon          44
## 3 guitar           45

me_data %>%
  filter(target_object %in% c("balloon", "apple", "guitar")) %>%
  count(target_object)

## # A tibble: 3 × 2
##   target_object     n
##   <chr>         <int>
## 1 apple            49
## 2 balloon          44
## 3 guitar           45

For each child, calculate the proportion of trials they got correct in each condition. Save it to a data frame called subject_means.

subject_means <- me_data %>%
  group_by(sub_id, condition) %>%
  summarize(prop_correct = sum(correct)/n())

## `summarise()` has grouped output by 'sub_id'. You can override using the `.groups` argument.

Use the subject_means data frame to calculate the mean proportion correct by condition. Plot the result as a bar plot. Include the following things:

a y-axis that scales from 0 to 1 (use ylim).
each condition as a different fill
an appropriate title
appropriate x- and y-axis labels
a red horizontal line indicating chance performance (use geom_hline(); geom_hline takes one parameter, yintercept).

Which condition are children better at?

overall_means <- subject_means %>%
  group_by(condition) %>%
  summarize(mean_correct = mean(prop_correct))

ggplot(overall_means, aes(x = condition, y = mean_correct, fill = condition)) +
  ggtitle("Mean correct selections by condition") +
   geom_bar(stat = "identity") +
  ylab("mean correct selections") +
  ylim(0,1) +
  geom_hline(yintercept = .5, color = "red")

Do children get better at the NF trials as they get older? Create a plot that shows mean performance at each age group (in years) on only NF trials.

subject_means_with_years <- me_data %>%
  group_by(sub_id, age_years, condition) %>%
  summarize(prop_correct = sum(correct)/n())

## `summarise()` has grouped output by 'sub_id', 'age_years'. You can override using the `.groups` argument.

overall_means_by_year <- subject_means_with_years %>%
  filter(condition == "NF") %>%
  group_by(age_years) %>%
  summarize(mean_prop_correct = mean(prop_correct))

ggplot(overall_means_by_year, aes(x = age_years, y = mean_prop_correct)) +
  geom_bar(stat = "identity") +
  ylab("Mean proportion correct NF trials") +
  ggtitle("Mean prop correct NF trials by age") +
  xlab("age (years)") +
  ylim(0,1) +
  geom_hline(yintercept = .5)

Make a version of the previous plot that shows performance on the NF trials at each age group for each target object. Use facet_wrap(). You’ll need to create a new data frame like subject_means_with_years but one that also includes the variable target_object. Call the new data frame subject_means_with_years_obj.

subject_means_with_years_obj <- me_data %>%
  group_by(sub_id, age_years, condition, target_object) %>%
  summarize(prop_correct = sum(correct)/n())

## `summarise()` has grouped output by 'sub_id', 'age_years', 'condition'. You can override using the `.groups` argument.

overall_means_by_year_obj <- subject_means_with_years_obj %>%
  filter(condition == "NF") %>%
  group_by(age_years, target_object) %>%
  summarize(mean_prop_correct = mean(prop_correct))

## `summarise()` has grouped output by 'age_years'. You can override using the `.groups` argument.

ggplot(overall_means_by_year_obj, aes(x = age_years, y = mean_prop_correct)) +
  geom_bar(stat = "identity") +
  ylim(0,1) +
  facet_wrap(~target_object) +
  geom_hline(yintercept = .5)

Using me_data, make a new variable called scaled_vocabulary_score that ranges from 0 to 1, rather than 0 to 100.

me_data <- me_data %>%
  mutate(scaled_vocabulary_score = vocabulary_score/100)

Use me_data to plot the distribution of children’s scaled_vocabulary_score. To do this, you’ll need a data frame with only one row per child. Use geom_histogram().

me_data %>%
  distinct(sub_id, .keep_all = TRUE) %>%
  ggplot(aes(x = scaled_vocabulary_score)) +
  geom_histogram(binwidth = .1)

Do older children have higher vocabularies? Recreate the plot below:

me_data %>%
  distinct(sub_id, .keep_all = TRUE) %>%
  ggplot(aes(x = age_months, y = vocabulary_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  xlab("Age (months)") +
  ylab("Vocabulary score") +
  ggtitle("Vocabulary vs. Age")

## `geom_smooth()` using formula 'y ~ x'

Recreate the plot below, where each point corresponds to an individual child.

subject_means_with_years_months <- me_data %>%
  filter(condition == "FF")  %>%
  group_by(sub_id, age_months, vocabulary_score) %>%
  summarize(prop_correct = sum(correct)/n())

## `summarise()` has grouped output by 'sub_id', 'age_months'. You can override using the `.groups` argument.

data_for_faceting <- pivot_longer(subject_means_with_years_months,cols = 2:3)

ggplot(data_for_faceting, aes(x=value, y = prop_correct)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~name, scales = "free") +
  labs(title = "Mean proportion correct on familiar-familiar trials",
       subtitle =  "Predicted by age (left) and vocabulary score (right)")

## `geom_smooth()` using formula 'y ~ x'

What other questions could we ask of this data?
[a] Pose an analytical question that could be answered with this data set.
[b] Make a clear, beautiful plot that helps answer this question. If appropriate, use multiple geoms in your plot. (use a geom other than geom_bar, geom_violin, geom_boxplot, geom_histogram)
[c] Interpret your plot.

Emily

Another analytical question that we could ask from this data set would be if as children age if they tend to get better at the FF or NF at the same rates. I’ll answer this question using the dataset and the plot below.

prop_correct_both_cond <- me_data %>% 
  group_by(sub_id, age_months, condition) %>% 
  summarise(prop_correct_both = mean(correct), .groups = 'keep')

prop_correct_both_cond %>% 
  ggplot(mapping = aes(x = age_months, y = prop_correct_both, color = condition)) +
  facet_wrap(~ condition) +
  geom_smooth(method = lm) +
  ggtitle("Mean proportion correct on trials in both NF and FF conditions at varying ages")

## `geom_smooth()` using formula 'y ~ x'

Based on the above plot with the smoothed linear line showing the trends in mena proportion correct at increasing ages (in months) for both the familiar-familiar condition and the novel-familiar condition, it appears that the NF condition starts at a lower base mean proportion correct in the youngest children, and increases more sharply over time than does the mean proportion correct in the FF condition.

Kelly

We could ask if the condition children perform better in changes over time.

me_data %>% 
  group_by(age_years, condition) %>% 
  mutate(prop_correct = mean(correct)) %>% 
  ggplot(aes(x = age_years,
             y = prop_correct,
             color = condition)) + 
  geom_line() +
  geom_point() + 
  labs(x = "Age",
       y = "Proportion of Trials Correct",
       title = "Change in Performance of NF vs. FF Condition over Time") +
  theme_classic()

It appears that while children perform much better in FF conditions at younger ages as they develop, they get better at the NF condition to the point that success rate is about even at 4 years of age. It would be interesting to see how the performance rate continues to change as children age more.

Raina

Analytical question: Is there a linear relationship between the proportion of NF trials answered correctly and the proportion of FF trials answered correctly by the children?

data_1 <- me_data %>% 
  group_by(sub_id, condition) %>% 
  filter(correct == TRUE) %>% 
  summarise(num_correct = n()) %>% 
  ungroup()

## `summarise()` has grouped output by 'sub_id'. You can override using the `.groups` argument.

data_2 <- me_data %>% 
  group_by(sub_id, condition) %>% 
  summarise(tot = n())

## `summarise()` has grouped output by 'sub_id'. You can override using the `.groups` argument.

full_data <- full_join(data_1, data_2)

## Joining, by = c("sub_id", "condition")

subject_means <- full_data %>% 
  mutate(prop = num_correct/tot)
ff_data <- subject_means %>% 
  filter(condition == 'FF') %>% 
  select(-c('num_correct','tot'))

nf_data <- subject_means %>% 
  filter(condition == 'NF') %>% 
  select(-c('num_correct','tot'))

wide_ff_data <- ff_data %>% 
  pivot_wider(names_from = condition,
              values_from = prop)

wide_nf_data <- nf_data %>% 
  pivot_wider(names_from = condition,
              values_from = prop)

wide_full_data <- full_join(wide_ff_data, wide_nf_data)

## Joining, by = "sub_id"

wide_full_data <- na.omit(wide_full_data)
ggplot(wide_full_data, aes(x = FF, y = NF)) +
  geom_point(position = 'jitter') +
  geom_smooth(method = 'lm') +
  labs(title = 'Relationship between proportion of trials answered correctly 
in the NF and FF conditions',
       x = 'Proportion of FF trials answered correctly',
       y = 'Proportion of NF trials answered correctly')

## `geom_smooth()` using formula 'y ~ x'

Interpretation of plot: As seen in the above plot, there are several children (each child is represented by a point in the plot) that either have a low X-value (low proportion of FF trials answered correctly) but have a high Y-value (high proportion of NF trials answered correctly). Similarly, there are some children who have a high X-value and a low Y-value. Hence, there does not seem to be a clear linear relationship between the proportion of FF and NF trials answered correctly by the children.

Assignment 3: More practice with the tidyverse - SOLUTIONS

Modern Research Methods

Emily

Kelly

Raina