Assignment 3: More practice with the tidyverse

Objectives

The primary objective of this assignment is to give you more practice with:

using the basic dplyr functions (filter, arrange, select, summarize, mutate, group_by)
creating and modifying basic plots using ggplot

You should also:

understand how to select rows from a dataset using slice()
understand how to get unique rows using distinct()
understand how to facet plots with facet_wrap() and facet_grid()
understand how to join dataframes
understand how to convert between “wide” and “long” data structures

This assignment is due Thursday, September 30th at noon. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment.

This assignment concerns a dataset from an experiment that tested whether 2-4 year-old children could learn new words from exclusion (Lewis, Cristiano, Lake, Kwan & Frank, 2020).

There were two conditions. In the critical condition, children saw two objects. One of the objects was an object that the child knew the label for (e.g., a ball) and the other object was an object that the child did not know the label for (e.g., tongs). The experimenter then asked the child to point to the novel object by saying, e.g., “Can you find the tongs?”. If the child assumes that each object only has one name, they should assume that this new label refers to the tongs, and not the ball. This phenomenon is called “Mutual Exclusivity” in the literature (Markman & Wachtel, 1988), because children are thought to assume that a new label is mutually exclusive with an old one. Let’s call this condition the “Novel-Familiar” condition, or NF.

In the control condition, children again saw two objects. This time both of the objects were objects that the child knew a label for (e.g., a ball and a cup). The experimenter then asked the child to point to one of the objects by saying, e.g., “Can you find the ball?”. Let’s call this condition the “Familiar-Familiar” condition, or FF.

Each child completed 7 trials: 4 in the NF condition and 3 in the FF condition. On each trial we recorded which object was the correct choice, and whether or not the child pointed to the correct object. We also measured two variables for each child: The age of the child and their performance on an separate vocabulary test.

Each variable in the dataset is described below:

sub_id - unique identifier for each participant in our dataset.
trial_num - trial number.
age_years - age of the child in years.
age_months - age of the child in months.
vocabulary_score - score (out of 100) on vocabulary test.
condition - NF or FF trial
target_object - correct object (e.g., ball, cup, tongs, etc)
correct - whether or not the child selected the correct object (TRUE/FALSE)

Here is the path to a lightly cleaned version of the dataset:

DATA_PATH <- "https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/tidy_me_data.csv"

Load the data frame and save it to a variable called me_data. Use the glimpse() function to determine:
[a] how many observations there are in the data frame,
[b] the variable type of sub_id, and
[c] the variable type of target_object.

[a] Use slice() to print rows 1 and 3 from me_data.
[b] Use arrange and slice() to print 7 rows of the first trial (where trial_num is 1).

[a] How many children participated in our experiment?
[b] How many children participated in our experiment who were at least three-and-a-half years of age?

How many individual trials were there where the target object was “balloon”, “apple” or “guitar”?
[a] Use group_by to answer this question.
[b] Use count to answer this question.

For each child, calculate the proportion of trials they got correct in each condition. Save it to a data frame called subject_means.

Use the subject_means data frame to calculate the mean proportion correct by condition. Plot the result as a bar plot. Include the following things:

a y-axis that scales from 0 to 1 (use ylim).
each condition as a different fill
an appropriate title
appropriate x- and y-axis labels
a red horizontal line indicating chance performance (use geom_hline(); geom_hline takes one parameter, yintercept).

Which condition are children better at?

Do children get better at the NF trials as they get older? Create a plot that shows mean performance at each age group (in years) on only NF trials.

Make a version of the previous plot that shows performance on the NF trials at each age group for each target object. Use facet_wrap(). You’ll need to create a new data frame like subject_means_with_years but one that also includes the variable target_object. Call the new data frame subject_means_with_years_obj.

Using me_data, make a new variable called scaled_vocabulary_score that ranges from 0 to 1, rather than 0 to 100.

Use me_data to plot the distribution of children’s scaled_vocabulary_score. To do this, you’ll need a data frame with only one row per child. Use geom_histogram().

Do older children have higher vocabularies? Recreate the plot below:

Recreate the plot below, where each point corresponds to an individual child.

What other questions could we ask of this data?
[a] Pose an analytical question that could be answered with this data set.
[b] Make a clear, beautiful plot that helps answer this question. If appropriate, use multiple geoms in your plot. (use a geom other than geom_bar, geom_violin, geom_boxplot, geom_histogram)
[c] Interpret your plot.

For inspiration, check out the R ggplot gallery.

Assignment 3: More practice with the tidyverse

Modern Research Methods