Objectives
The primary objective of this assignment is to give you more practice with:
filter
, arrange
, select
, summarize
, mutate
, group_by
)You should also:
slice()
distinct()
facet_wrap()
and facet_grid()
This assignment is due Thursday, September 30th at noon. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment.
This assignment concerns a dataset from an experiment that tested whether 2-4 year-old children could learn new words from exclusion (Lewis, Cristiano, Lake, Kwan & Frank, 2020).
There were two conditions. In the critical condition, children saw two objects. One of the objects was an object that the child knew the label for (e.g., a ball) and the other object was an object that the child did not know the label for (e.g., tongs). The experimenter then asked the child to point to the novel object by saying, e.g., “Can you find the tongs?”. If the child assumes that each object only has one name, they should assume that this new label refers to the tongs, and not the ball. This phenomenon is called “Mutual Exclusivity” in the literature (Markman & Wachtel, 1988), because children are thought to assume that a new label is mutually exclusive with an old one. Let’s call this condition the “Novel-Familiar” condition, or NF.
In the control condition, children again saw two objects. This time both of the objects were objects that the child knew a label for (e.g., a ball and a cup). The experimenter then asked the child to point to one of the objects by saying, e.g., “Can you find the ball?”. Let’s call this condition the “Familiar-Familiar” condition, or FF.
Each child completed 7 trials: 4 in the NF condition and 3 in the FF condition. On each trial we recorded which object was the correct choice, and whether or not the child pointed to the correct object. We also measured two variables for each child: The age of the child and their performance on an separate vocabulary test.
Each variable in the dataset is described below:
Here is the path to a lightly cleaned version of the dataset:
<- "https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/tidy_me_data.csv" DATA_PATH
me_data
. Use the glimpse()
function to determine: sub_id
, and target_object
.<- read_csv(DATA_PATH) me_data
## Rows: 1274 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): condition, target_object
## dbl (5): sub_id, trial_num, age_years, age_months, vocabulary_score
## lgl (1): correct
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(me_data)
## Rows: 1,274
## Columns: 8
## $ sub_id <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ trial_num <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4,…
## $ age_years <dbl> 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,…
## $ age_months <dbl> 47.04, 47.04, 47.04, 47.04, 47.04, 47.04, 47.04, 30.3…
## $ vocabulary_score <dbl> 72.73, 72.73, 72.73, 72.73, 72.73, 72.73, 72.73, 59.0…
## $ condition <chr> "FF", "FF", "FF", "NF", "NF", "NF", "NF", "FF", "FF",…
## $ target_object <chr> "balloon", "bowl", "apple", "helm", "beaker", "gourd"…
## $ correct <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
There are 1274 observations. sub_id is numeric/a double. target_objrect is a character/string.
slice()
to print rows 1 and 3 from me_data
. arrange
and slice()
to print 7 rows of the first trial (where trial_num is 1).%>%
me_data arrange(trial_num) %>%
slice(1,3)
## # A tibble: 2 × 8
## sub_id trial_num age_years age_months vocabulary_score condition target_object
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 3 47.0 72.7 FF balloon
## 2 3 1 3 41.6 77.3 FF cookie
## # … with 1 more variable: correct <lgl>
%>%
me_data arrange(trial_num) %>%
slice(1:7)
## # A tibble: 7 × 8
## sub_id trial_num age_years age_months vocabulary_score condition target_object
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 3 47.0 72.7 FF balloon
## 2 2 1 2 30.4 59.1 FF cup
## 3 3 1 3 41.6 77.3 FF cookie
## 4 4 1 3 41.5 72.7 FF guitar
## 5 5 1 3 42.4 68.2 FF guitar
## 6 6 1 2 32.3 63.6 FF apple
## 7 7 1 3 38.0 68.2 FF apple
## # … with 1 more variable: correct <lgl>
%>%
me_data distinct(sub_id) %>%
nrow()
## [1] 185
%>%
me_data filter(age_months >= 42) %>%
distinct(sub_id) %>%
nrow()
## [1] 52
There are 185 children in the experiment. There are 52 children who are at least three and half years of age.
group_by
to answer this question. count
to answer this question.%>%
me_data filter(target_object %in% c("balloon", "apple", "guitar")) %>%
group_by(target_object) %>%
summarize(n = n())
## # A tibble: 3 × 2
## target_object n
## <chr> <int>
## 1 apple 49
## 2 balloon 44
## 3 guitar 45
%>%
me_data filter(target_object %in% c("balloon", "apple", "guitar")) %>%
count(target_object)
## # A tibble: 3 × 2
## target_object n
## <chr> <int>
## 1 apple 49
## 2 balloon 44
## 3 guitar 45
subject_means
.<- me_data %>%
subject_means group_by(sub_id, condition) %>%
summarize(prop_correct = sum(correct)/n())
## `summarise()` has grouped output by 'sub_id'. You can override using the `.groups` argument.
subject_means
data frame to calculate the mean proportion correct by condition. Plot the result as a bar plot. Include the following things:ylim
).geom_hline()
; geom_hline takes one parameter, yintercept).Which condition are children better at?
<- subject_means %>%
overall_means group_by(condition) %>%
summarize(mean_correct = mean(prop_correct))
ggplot(overall_means, aes(x = condition, y = mean_correct, fill = condition)) +
ggtitle("Mean correct selections by condition") +
geom_bar(stat = "identity") +
ylab("mean correct selections") +
ylim(0,1) +
geom_hline(yintercept = .5, color = "red")
<- me_data %>%
subject_means_with_years group_by(sub_id, age_years, condition) %>%
summarize(prop_correct = sum(correct)/n())
## `summarise()` has grouped output by 'sub_id', 'age_years'. You can override using the `.groups` argument.
<- subject_means_with_years %>%
overall_means_by_year filter(condition == "NF") %>%
group_by(age_years) %>%
summarize(mean_prop_correct = mean(prop_correct))
ggplot(overall_means_by_year, aes(x = age_years, y = mean_prop_correct)) +
geom_bar(stat = "identity") +
ylab("Mean proportion correct NF trials") +
ggtitle("Mean prop correct NF trials by age") +
xlab("age (years)") +
ylim(0,1) +
geom_hline(yintercept = .5)
facet_wrap()
. You’ll need to create a new data frame like subject_means_with_years
but one that also includes the variable target_object
. Call the new data frame subject_means_with_years_obj.
<- me_data %>%
subject_means_with_years_obj group_by(sub_id, age_years, condition, target_object) %>%
summarize(prop_correct = sum(correct)/n())
## `summarise()` has grouped output by 'sub_id', 'age_years', 'condition'. You can override using the `.groups` argument.
<- subject_means_with_years_obj %>%
overall_means_by_year_obj filter(condition == "NF") %>%
group_by(age_years, target_object) %>%
summarize(mean_prop_correct = mean(prop_correct))
## `summarise()` has grouped output by 'age_years'. You can override using the `.groups` argument.
ggplot(overall_means_by_year_obj, aes(x = age_years, y = mean_prop_correct)) +
geom_bar(stat = "identity") +
ylim(0,1) +
facet_wrap(~target_object) +
geom_hline(yintercept = .5)
me_data
, make a new variable called scaled_vocabulary_score
that ranges from 0 to 1, rather than 0 to 100.<- me_data %>%
me_data mutate(scaled_vocabulary_score = vocabulary_score/100)
me_data
to plot the distribution of children’s scaled_vocabulary_score
. To do this, you’ll need a data frame with only one row per child. Use geom_histogram()
.%>%
me_data distinct(sub_id, .keep_all = TRUE) %>%
ggplot(aes(x = scaled_vocabulary_score)) +
geom_histogram(binwidth = .1)
%>%
me_data distinct(sub_id, .keep_all = TRUE) %>%
ggplot(aes(x = age_months, y = vocabulary_score)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Age (months)") +
ylab("Vocabulary score") +
ggtitle("Vocabulary vs. Age")
## `geom_smooth()` using formula 'y ~ x'
<- me_data %>%
subject_means_with_years_months filter(condition == "FF") %>%
group_by(sub_id, age_months, vocabulary_score) %>%
summarize(prop_correct = sum(correct)/n())
## `summarise()` has grouped output by 'sub_id', 'age_months'. You can override using the `.groups` argument.
<- pivot_longer(subject_means_with_years_months,cols = 2:3)
data_for_faceting
ggplot(data_for_faceting, aes(x=value, y = prop_correct)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~name, scales = "free") +
labs(title = "Mean proportion correct on familiar-familiar trials",
subtitle = "Predicted by age (left) and vocabulary score (right)")
## `geom_smooth()` using formula 'y ~ x'
Another analytical question that we could ask from this data set would be if as children age if they tend to get better at the FF or NF at the same rates. I’ll answer this question using the dataset and the plot below.
<- me_data %>%
prop_correct_both_cond group_by(sub_id, age_months, condition) %>%
summarise(prop_correct_both = mean(correct), .groups = 'keep')
%>%
prop_correct_both_cond ggplot(mapping = aes(x = age_months, y = prop_correct_both, color = condition)) +
facet_wrap(~ condition) +
geom_smooth(method = lm) +
ggtitle("Mean proportion correct on trials in both NF and FF conditions at varying ages")
## `geom_smooth()` using formula 'y ~ x'
Based on the above plot with the smoothed linear line showing the trends in mena proportion correct at increasing ages (in months) for both the familiar-familiar condition and the novel-familiar condition, it appears that the NF condition starts at a lower base mean proportion correct in the youngest children, and increases more sharply over time than does the mean proportion correct in the FF condition.
We could ask if the condition children perform better in changes over time.
%>%
me_data group_by(age_years, condition) %>%
mutate(prop_correct = mean(correct)) %>%
ggplot(aes(x = age_years,
y = prop_correct,
color = condition)) +
geom_line() +
geom_point() +
labs(x = "Age",
y = "Proportion of Trials Correct",
title = "Change in Performance of NF vs. FF Condition over Time") +
theme_classic()
It appears that while children perform much better in FF conditions at younger ages as they develop, they get better at the NF condition to the point that success rate is about even at 4 years of age. It would be interesting to see how the performance rate continues to change as children age more.
Analytical question: Is there a linear relationship between the proportion of NF trials answered correctly and the proportion of FF trials answered correctly by the children?
<- me_data %>%
data_1 group_by(sub_id, condition) %>%
filter(correct == TRUE) %>%
summarise(num_correct = n()) %>%
ungroup()
## `summarise()` has grouped output by 'sub_id'. You can override using the `.groups` argument.
<- me_data %>%
data_2 group_by(sub_id, condition) %>%
summarise(tot = n())
## `summarise()` has grouped output by 'sub_id'. You can override using the `.groups` argument.
<- full_join(data_1, data_2) full_data
## Joining, by = c("sub_id", "condition")
<- full_data %>%
subject_means mutate(prop = num_correct/tot)
<- subject_means %>%
ff_data filter(condition == 'FF') %>%
select(-c('num_correct','tot'))
<- subject_means %>%
nf_data filter(condition == 'NF') %>%
select(-c('num_correct','tot'))
<- ff_data %>%
wide_ff_data pivot_wider(names_from = condition,
values_from = prop)
<- nf_data %>%
wide_nf_data pivot_wider(names_from = condition,
values_from = prop)
<- full_join(wide_ff_data, wide_nf_data) wide_full_data
## Joining, by = "sub_id"
<- na.omit(wide_full_data)
wide_full_data ggplot(wide_full_data, aes(x = FF, y = NF)) +
geom_point(position = 'jitter') +
geom_smooth(method = 'lm') +
labs(title = 'Relationship between proportion of trials answered correctly
in the NF and FF conditions',
x = 'Proportion of FF trials answered correctly',
y = 'Proportion of NF trials answered correctly')
## `geom_smooth()` using formula 'y ~ x'
Interpretation of plot: As seen in the above plot, there are several children (each child is represented by a point in the plot) that either have a low X-value (low proportion of FF trials answered correctly) but have a high Y-value (high proportion of NF trials answered correctly). Similarly, there are some children who have a high X-value and a low Y-value. Hence, there does not seem to be a clear linear relationship between the proportion of FF and NF trials answered correctly by the children.