Objectives
By the end of this assignment, you should:
- understand the concept of “cumulative science”
- be able to identify the type of a variable
- understand the properties of “tidy data”
- understand how to isolate data (
select
, filter
, arrange
)
- understand how to use the pipe operator (
%>%
)
This assignment is due Thursday, September 16th at noon. You should complete the assignment in the .Rmd template. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment. If you need help, there a lot of resources available to you. Please reach out if you’re stuck.
To get started, you’ll need to download and open up the Rmarkdown template in RStudio. The first few exercises focus on data from the Lewis & Frank (2018) replication of the Xu and Tenenbaum (2007) experiment (that we talked about in lecture). We’ll be working with data from the first experiment only. For reference, the journal paper write up of this study can be found here, and you can see the actual experiment that participants saw here.
The data are in a file called lewis_2018_exp1.csv
that lives on the internet. We can load the data into R by passing the online filepath to the read_csv()
function. Once we read it into R, we can save it to a variable called lf_data
:
lf_data <- read_csv("https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/lewis_2018_exp1.csv")
There are six variables in the data and each variable is described below. The first six rows of the data frame are also displayed below.
- exp - Experiment number. Lewis & Frank (2016) had 12 experiments in it; the present dataset only includes the data from the first experiment.
- subids - Subject ID. This is an anonymous id that uniquely identifies every participant in the study.
- trial_num - Each participant completed 12 “trials.” In this case, a trial is a single screen where the participant sees a novel word, one or more examples, and then is asked to click on other examples of the novel word.
- category - There were three different categories of objects: vehicles, vegetables, and animals. Each participant saw some trials from each category.
- condition - This is the variable that we manipulated. It refers to the number of examples of the novel word participants saw at the top of the page. Participants saw either 3 subordinate examples (“three_subordinate”; e.g., 3 dalmations), 3 basic level examples (“three_basic”; e.g. a dalmation, a poodle, and a bernese mountain dog), 3 superordinate examples (“three_superordinate”; e.g. a dalmation, a rabbit, and a horse), or just a single example (“one”; e.g. 1 dalmation).
- proportion_basic_level_responses - This is the variable that we measured. It refers to the proportion (out of 2 possible) of basic level examples that a participant selected.
1 |
1 |
9 |
vehicles |
three_subordinate |
0 |
1 |
2 |
9 |
animals |
three_basic |
1 |
1 |
3 |
9 |
animals |
three_superordinate |
1 |
1 |
4 |
9 |
vehicles |
three_superordinate |
1 |
1 |
5 |
9 |
animals |
three_superordinate |
1 |
1 |
6 |
9 |
vegetables |
three_subordinate |
0 |
- Is this dataset tidy? Describe the smallest unit of observation in this dataset.
- Select the columns
subids
, category
, proportion_basic_level_responses
from the data. Print the first six rows of this data frame.
- Print the first six rows of a data frame excluding the
category
column.
- Use logical tests and Boolean operators to return only the rows that contain trials (rows): [a] with category as vegetables, [b] with category as vehicles and a trial greater than 3, [c] with category as vegetables or animals, [d] with at least one basic level response in the “one” condition.
- The following code selects all trials (rows) where the condition was either “three_subordinate” or “one.” Rewrite this code in a way that uses the
%in%
operator.
filter(lf_data, condition == "three_subordinate" | condition == "one")
- How many trials are there where the category is either vegetables or animals? Use
nrow()
.
- The three following sets of commands are written without the pipe operator (
%>%
). Rewrite each one to include the pipe.
[a]
var1 <- mutate(lf_data, category)
[b]
var1 <- select(lf_data, category)
var2 <- nrow(var1)
[c]
var1 <- filter(lf_data, trial_num == 1)
var2 <- filter(var1, category == "animals")
var3 <- select(var2, trial_num, category)
- The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe.
[a]
lf_data %>%
filter(trial_num < 6) %>%
nrow()
[b]
lf_data %>%
select(subids, category, proportion_basic_level_responses) %>%
filter(subids == 1) %>%
arrange(category)
- Look at the code below. Describe in full sentences what this code does.
lf_data %>%
select(subids, category, condition) %>%
filter(category == "vehicles" & condition != "one") %>%
arrange(-subids)
- On the first day of class, we talked about the “Sally Anne Task” that measures children’s understanding of theory of mind (example videos). Describe four variables that you could measure in this task to assess children’s theory of mind performance. Specifically, describe (1) one qualitative variable, (2) one quantitative - binary variable, (3) one quantitative - numeric, and (4) one quantitative - real variable. For each variable, give a one sentence description of the variable, AND one example value of that variable with units.
- Describe the ways in which the scientific process could be described as a “social endeavor”. Your answer should make reference to the concepts of “replication” and “reproducibility”. Please respond with a short paragraph.
---
title: "Assignment 1: Cumulative Science and Intro to dplyr"
subtitle: "Modern Research Methods"
output:
  html_document:
    code_download: true
    css: ../lab.css
    highlight: kate
    theme: cosmo
    toc: false
    toc_float: false
---

```{r global_options, include = F}
library(tidyverse)
library(knitr)
```


<br>
<br>
<div id="boxedtext">

 <font size="4"> **Objectives** </font> 
 
By the end of this assignment, you should:

- understand the concept of "cumulative science"
- be able to identify the type of a variable
- understand the properties of "tidy data"
- understand how to isolate data ( `select`, `filter`, `arrange`)
- understand how to use the pipe operator (`%>%`)
</div>

This assignment is due **Thursday, September 16th at noon**. You should complete the assignment in the .Rmd template. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment. If you need help, there [a lot of resources](/resource/getting_help/) available to you. Please reach out if you're stuck. 

<br> 

To get started, you'll need to download and open up the <a href="/assignment/01_MRM_assignment_template.Rmd" download>Rmarkdown template</a> in RStudio. The first few exercises  focus on data from the Lewis & Frank (2018) replication of the Xu and Tenenbaum (2007) experiment (that we talked about in lecture). We'll be working with data from the first experiment only.  For reference, the journal paper write up of this study can be found [here](http://www.andrew.cmu.edu/user/mollylew/papers/LF_2018.pdf), and you can see the actual experiment that participants saw [here](https://langcog.stanford.edu/expts/MLL/XTMEM/exp1/exp1.html).

The data are in a file called `lewis_2018_exp1.csv` that lives on the internet. We can load the data into R by passing the online filepath to the `read_csv()` function. Once we read it into R, we can save it to a variable called `lf_data`:
```{r, message = F}
lf_data <- read_csv("https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/lewis_2018_exp1.csv")
```

There are six variables in the data and each variable is described below. The first six rows of the data frame are also displayed below. 

* **exp** - Experiment number. Lewis & Frank (2016) had 12 experiments in it; the present dataset only includes the data from the first experiment.
* **subids** - Subject ID. This is an anonymous id that uniquely identifies every participant in the study.
* **trial_num** - Each participant completed 12 "trials." In this case, a trial is a single screen where the participant sees a novel word, one or more examples, and then is asked to click on other examples of the novel word.
* **category** - There were three different categories of objects: vehicles, vegetables, and animals. Each participant saw some trials from each category. 
* **condition** - This is the variable that we manipulated. It refers to the number of examples of the novel word participants saw at the top of the page. Participants saw either 3 subordinate examples ("three_subordinate"; e.g., 3 dalmations),  3 basic level examples ("three_basic"; e.g. a dalmation, a poodle, and a bernese mountain dog), 3 superordinate examples ("three_superordinate"; e.g. a dalmation, a rabbit, and a horse), or just a single example ("one"; e.g. 1 dalmation).
* **proportion_basic_level_responses** - This is the variable that we measured. It refers to the proportion (out of 2 possible) of basic level examples that a participant selected. 

```{r, echo =F}
kable(head(lf_data))
```

<br> 

(1) Is this dataset tidy? Describe the smallest unit of observation in this dataset.

<br> 

(1) Select the columns `subids`, `category`, `proportion_basic_level_responses` from the data. Print the first six rows of this data frame. 


<br> 

(1) Print the first six rows of a data frame excluding the `category` column.


<br> 

(1) Use logical tests and Boolean operators to return only the rows that contain trials (rows): [a] with category as vegetables, [b] with category as vehicles and a trial greater than 3, [c] with category as vegetables or animals,  [d] with at least one basic level response in the "one" condition.

<br> 

(1) The following code selects all trials (rows) where the condition was either "three_subordinate" or "one." Rewrite this code in a way that uses the `%in%` operator. 

```{r, eval = F}
filter(lf_data, condition == "three_subordinate" | condition == "one")
```

<br> 

(1) How many trials are there where the category is either vegetables or animals? Use `nrow()`.

<br> 

(1) The three following sets of commands are written without the pipe operator (`%>%`). Rewrite each one to include the pipe. 

[a]
```{r}
var1 <- mutate(lf_data, category)
```

[b]
```{r}
var1 <- select(lf_data, category)
var2 <- nrow(var1)
```

[c]
```{r}
var1 <- filter(lf_data, trial_num == 1)
var2 <- filter(var1, category == "animals")
var3 <- select(var2, trial_num, category)
```

<br> 

(1) The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe. 

[a]
```{r, eval = F}
lf_data %>%
  filter(trial_num < 6) %>%
  nrow()
```

[b]
```{r, eval = F}
lf_data %>%
  select(subids, category, proportion_basic_level_responses) %>%
  filter(subids == 1) %>%
  arrange(category)
```

<br> 

(1) Look at the code below. Describe in full sentences what this code does.

```{r, eval = F}
lf_data %>%
  select(subids, category, condition) %>%
  filter(category == "vehicles" & condition != "one") %>%
  arrange(-subids)
```

<br> 

(1) On the first day of class, we talked about the "Sally Anne Task" that measures children's understanding of theory of mind ([example videos](https://www.youtube.com/watch?v=oazK2fkRU1A])). Describe four variables that you could measure in this task to assess children's theory of mind performance. Specifically, describe (1) one qualitative variable, (2) one quantitative - binary variable, (3) one quantitative - numeric, and (4) one quantitative -  real variable. For each variable, give a one sentence description of the variable, AND one example value of that variable with units.

<br> 


(1) Describe the ways in which the scientific process could be described as a "social endeavor". Your answer should make reference to the concepts of "replication" and "reproducibility". Please respond with a short paragraph. 

