Objectives

By the end of this assignment, you should:

This assignment is due Thursday, September 16th at noon. You should complete the assignment in the .Rmd template. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment. If you need help, there a lot of resources available to you. Please reach out if you’re stuck.


To get started, you’ll need to download and open up the Rmarkdown template in RStudio. The first few exercises focus on data from the Lewis & Frank (2018) replication of the Xu and Tenenbaum (2007) experiment (that we talked about in lecture). We’ll be working with data from the first experiment only. For reference, the journal paper write up of this study can be found here, and you can see the actual experiment that participants saw here.

The data are in a file called lewis_2018_exp1.csv that lives on the internet. We can load the data into R by passing the online filepath to the read_csv() function. Once we read it into R, we can save it to a variable called lf_data:

lf_data <- read_csv("https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/lewis_2018_exp1.csv")

There are six variables in the data and each variable is described below. The first six rows of the data frame are also displayed below.

exp subids trial_num category condition proportion_basic_level_responses
1 1 9 vehicles three_subordinate 0
1 2 9 animals three_basic 1
1 3 9 animals three_superordinate 1
1 4 9 vehicles three_superordinate 1
1 5 9 animals three_superordinate 1
1 6 9 vegetables three_subordinate 0


  1. Is this dataset tidy? Describe the smallest unit of observation in this dataset.

Yes. Smallest observation is subjids/category/condition. Would also accept subids/trial_num.


  1. Select the columns subids, category, proportion_basic_level_responses from the data. Print the first six rows of this data frame.
lf_data %>%
  select(subids, category, proportion_basic_level_responses) %>%
  head()
## # A tibble: 6 × 3
##   subids category   proportion_basic_level_responses
##    <dbl> <chr>                                 <dbl>
## 1      1 vehicles                                  0
## 2      2 animals                                   1
## 3      3 animals                                   1
## 4      4 vehicles                                  1
## 5      5 animals                                   1
## 6      6 vegetables                                0


  1. Print the first six rows of a data frame excluding the category column.
lf_data %>%
  select(-category) %>%
  head()
## # A tibble: 6 × 5
##     exp subids trial_num condition           proportion_basic_level_responses
##   <dbl>  <dbl>     <dbl> <chr>                                          <dbl>
## 1     1      1         9 three_subordinate                                  0
## 2     1      2         9 three_basic                                        1
## 3     1      3         9 three_superordinate                                1
## 4     1      4         9 three_superordinate                                1
## 5     1      5         9 three_superordinate                                1
## 6     1      6         9 three_subordinate                                  0

OR

lf_data %>%
    select(exp, subids, trial_num, condition,
           proportion_basic_level_responses) %>%
  head() 
## # A tibble: 6 × 5
##     exp subids trial_num condition           proportion_basic_level_responses
##   <dbl>  <dbl>     <dbl> <chr>                                          <dbl>
## 1     1      1         9 three_subordinate                                  0
## 2     1      2         9 three_basic                                        1
## 3     1      3         9 three_superordinate                                1
## 4     1      4         9 three_superordinate                                1
## 5     1      5         9 three_superordinate                                1
## 6     1      6         9 three_subordinate                                  0


  1. Use logical tests and Boolean operators to return only the rows that contain trials (rows): a) with category as vegetables, b) with category as vehicles and a trial greater than 3, c) with category as vegetables or animals, d) with at least one basic level response in the “one” condition.

a

lf_data %>%
  filter(category == "vegetables") 

b

lf_data %>%
  filter(category == "animals", trial_num < 7) 

c

lf_data %>%
  filter(category == "animals" | category == "vegetables") 

OR

filter(lf_data, category %in% c("animals", "vegetables"))

d

lf_data %>%
  filter(proportion_basic_level_responses > 0, condition == "one") 

OR

lf_data %>%
  filter(proportion_basic_level_responses >= 0.5, condition == "one")


  1. The following code selects all trials (rows) where the condition was either “three_subordinate” or “one.” Rewrite this code in a way that uses the %in% operator.
filter(lf_data, condition == "three_subordinate" | condition == "one")
filter(lf_data, condition %in%  c("three_subordinate", "one"))


  1. How many trials are there where the category is either vegetables or animals? Use nrow().
lf_data %>%
  filter(category == "vegetables"| category == "animals") %>%
  nrow()

OR

lf_data %>%
  filter(category %in% c("vegetables", "animals")) %>%
  nrow()


  1. The three following sets of commands are written without the pipe operator (%>%). Rewrite each one to include the pipe.

a

var1 <- mutate(lf_data, category)

b

var1 <- select(lf_data, category)
var2 <- nrow(var1)

c

var1 <- filter(lf_data, trial_num == 1)
var2 <- filter(var1, category == "animals")
var3 <- select(var2, trial_num, category)

a

var1 <- lf_data %>%
  mutate(category)

b

var1 <- lf_data %>% 
  select(category) %>%
  nrow()

c

var1 <- lf_data %>%
  filter(trial_num == 1,
        category == "animals") %>%
  select(trial_num, category)


  1. The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe.

a

lf_data %>%
  filter(trial_num < 6) %>%
  nrow()

b

lf_data %>%
  select(subids, category, proportion_basic_level_responses) %>%
  filter(subids == 1) %>%
  arrange(category)


  1. The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe.

a

var1 <- filter(lf_data, trial_num < 6) 
var2 <- nrow(var1)

OR

nrow(filter(lf_data,trial_num < 6))

b

var1 <- select(lf_data, subids, category, proportion_basic_level_responses) 
var2 <- filter(var1, subids == 1)
var3 <- arrange(var2, category)

OR

arrange(filter(select(lf_data,subids, category, proportion_basic_level_responses), subids == 1), category)


  1. Look at the code below. Describe in full sentences what this code does.
lf_data %>%
  select(subids, category, condition) %>%
  filter(category == "vehicles" & condition != "one") %>%
  arrange(-subids)


  1. On the first day of class, we talked about the “Sally Anne Task” that measures children’s understanding of theory of mind (example videos). Describe four variables that you could measure in this task to assess children’s theory of mind performance. Specifically, describe (1) one qualitative variable, (2) one quantitative - binary variable, (3) one quantitative - numeric, and (4) one quantitative - real variable. For each variable, give a one sentence description of the variable, AND one example value of that variable with units.

Nora: qualitative: We could measure how engaged the chid was with the task: whether the child seemed to be paying attention, was distracted, seemed confused, or was fussy.

Victoria: Qualitative: location of child’s gaze. This variable would be an observation of where the child is looking after the researcher demonstrates the task. Examples would include: basket, box, or the table (in between the basket and the box).

Jonathan: Qualitative variable: A variable that measure the child’s facial expression during the experiment. Possible values could be “interested”, “bored”, “puzzled”, etc.

Raina: Qualitative variable: This could be a type of variable that records the type of attention exhibited/displayed by each child who participated in the Sally Anne Task. Example value: A child could have an attention type of ‘attentive’, ‘fidgety’, ‘sleepy’, ‘bored’, etc.


  1. Describe the ways in which the scientific process could be described as a “social endeavor”. Your answer should make reference to the concepts of “replication” and “reproducibility”. Please respond with a short paragraph.

Erik: The scientific process could certainly be described as a “social endeavor”. When someone designs, executes, and publishes an experiment and their findings, other researchers interested in the same field will often look at these publications. Based on the findings, they may want to replicate the experiment in order to reproduce the findings. This is why, when going through the scientific process, it is important to consider replication, including describing the experiment fully and each step in detail. This will increase the reproducibility for other researchers, who may have good ideas to further the findings of the original experiment. This collaboration helps further the findings to adapt previous theories. So, if the original experiment had some theory as a result of the findings, replication of the experiment by others can prove or adapt the theory. If the experiment has reproducable findings, then a theory can be agreed upon. This shows how the scientific process is social, in that while one person may make a theory, it takes many to prove a theory right (reproducibility), or to adapt a theory after replicating an experiment.

Victoria: The scientific process can be considered a “social endeavor” because scientists and researchers must work together, collaborate, and share data to ensure that experiments are both replicable and reproducible. Replicability refers to repeating a study with the same hypothesis, experimental design, population, and analysis and achieving the same results. Replication involves communicating with other scientists whose experiments you are trying to replicate to ensure that there are no discrepancies and differing variables between your studies. Reproducibility refers to repeated procedures with replicated experiments that achieve the same results; the experimental plan and necessary code should be published online to ensure that other researchers who want to replicate the experiment have the data they need. Achieving reproducibility involves also communicating with scientists who have done previous research in the field to either confirm and build upon proposed theories if you have reproducible results or formulate new theories to reconciles differing results if you or others have different findings. Overall, the scientific process involves collaboration and sharing of data between researchers to reconcile findings and build upon theories.

Jaemin: The scientific process is used to see how the world works by acquiring knowledge in an empirical way by asking a specific question, conducting research in the area, providing a hypothesis, testing the hypothesis, gathering the data, and then report the conclusions. The process as a social endeavor is to solve these kinds of problems in order to complete certain tasks, satisfy needs, and make an advancement in society. However, we must make sure that when we do conduct the process that the results should be replicated, as we can conclude that it is more likely to be valid rather than by chance. The results must also be reproducible, so that we can make sure that we are obtaining consistent results when we use the same kind of input and conduct the same steps. Aiming these kinds of consistent results and being able to replicate these kinds of results is how the scientific process should work so it will be a social endeavor for our society going forward.

Bethany: Part of the scientific process relates to the ability to reproduce results. This is important to ensure that the results, and subsequent conclusions, are due to manipulations in the study rather than confounding factors (e.g., unknown third variables, procedural decisions, sample characteristics). This means that the results from studies must be reproducible by others who re-conduct the experiment, known as replication studies. Therefore, the scientific process could be considered as a social endeavor because the ability to replicate and reproduce results depends on dissemination of research and communication between researchers to uncover what the true relationships are between variables.

Emily: The scientific process could be described as a social endeavor because a proper scientific study should involve the experiment being able to be replicated by other scientists, and it in fact should be replicated or reproduced by other scientists to ensure that the same results can be found when the experiment is repeated by other people. This allows for any biases by the initial experimenters to be eliminated (e.g. if the initial group of experimenters was looking for a specific outcome and allowed their bias to influence the way they performed the experiment), as well as any errors that they may have made in the process. Replication involves repeating a study with the same exact population, hypothesis, experimental design, and analytic plan, and would ideally result in obtaining the same results as the study you’re modeling off of, while reproducing a study involves repeating the same procedure, by not necessarily with the same population, etc., and would also ideally result in obtaining the same results as the initial study. Because replication and reproducibility of studies are vital to showing that the results they obtain are both valid and reliable, the scientific process must be a social endeavor, since generating a standing theory would therefore involve many different groups of scientists testing the same theory individually, and this means that a true theory can’t really be proved by just one person.

---
title: "Assignment 1: Cumulative Science and Intro to dplyr - SOLUTIONS"
subtitle: "Modern Research Methods"
output:
  html_document:
    code_download: true
    css: lab.css
    highlight: kate
    theme: cosmo
    toc: false
    toc_float: false
---

```{r global_options, include = F}
library(tidyverse)
library(knitr)
```


<br>
<br>
<div id="boxedtext">

 <font size="4"> **Objectives** </font> 
 
By the end of this assignment, you should:

- understand the concept of "cumulative science"
- be able to identify the type of a variable
- understand the properties of "tidy data"
- understand how to isolate data ( `select`, `filter`, `arrange`)
- understand how to use the pipe operator (`%>%`)
</div>

This assignment is due **Thursday, September 16th at noon**. You should complete the assignment in the .Rmd template. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment. If you need help, there [a lot of resources](/resource/getting_help/) available to you. Please reach out if you're stuck. 

<br> 

To get started, you'll need to download and open up the <a href="/assignment/01_MRM_assignment_template.Rmd" download>Rmarkdown template</a> in RStudio. The first few exercises  focus on data from the Lewis & Frank (2018) replication of the Xu and Tenenbaum (2007) experiment (that we talked about in lecture). We'll be working with data from the first experiment only.  For reference, the journal paper write up of this study can be found [here](http://www.andrew.cmu.edu/user/mollylew/papers/LF_2018.pdf), and you can see the actual experiment that participants saw [here](https://langcog.stanford.edu/expts/MLL/XTMEM/exp1/exp1.html).

The data are in a file called `lewis_2018_exp1.csv` that lives on the internet. We can load the data into R by passing the online filepath to the `read_csv()` function. Once we read it into R, we can save it to a variable called `lf_data`:
```{r, message = F}
lf_data <- read_csv("https://raw.githubusercontent.com/mllewis/cumulative-science/master/static/data/lewis_2018_exp1.csv")
```

There are six variables in the data and each variable is described below. The first six rows of the data frame are also displayed below. 

* **exp** - Experiment number. Lewis & Frank (2016) had 12 experiments in it; the present dataset only includes the data from the first experiment.
* **subids** - Subject ID. This is an anonymous id that uniquely identifies every participant in the study.
* **trial_num** - Each participant completed 12 "trials." In this case, a trial is a single screen where the participant sees a novel word, one or more examples, and then is asked to click on other examples of the novel word.
* **category** - There were three different categories of objects: vehicles, vegetables, and animals. Each participant saw some trials from each category. 
* **condition** - This is the variable that we manipulated. It refers to the number of examples of the novel word participants saw at the top of the page. Participants saw either 3 subordinate examples ("three_subordinate"; e.g., 3 dalmations),  3 basic level examples ("three_basic"; e.g. a dalmation, a poodle, and a bernese mountain dog), 3 superordinate examples ("three_superordinate"; e.g. a dalmation, a rabbit, and a horse), or just a single example ("one"; e.g. 1 dalmation).
* **proportion_basic_level_responses** - This is the variable that we measured. It refers to the proportion (out of 2 possible) of basic level examples that a participant selected. 

```{r, echo =F}
kable(head(lf_data))
```

<br> 

(1) Is this dataset tidy? Describe the smallest unit of observation in this dataset.

Yes. Smallest observation is subjids/category/condition. Would also accept subids/trial_num. 


<br> 

(1) Select the columns `subids`, `category`, `proportion_basic_level_responses` from the data. Print the first six rows of this data frame. 

```{r}
lf_data %>%
  select(subids, category, proportion_basic_level_responses) %>%
  head()
```


<br> 

(1) Print the first six rows of a data frame excluding the `category` column.

```{r}
lf_data %>%
  select(-category) %>%
  head()
```

OR 

```{r}
lf_data %>%
    select(exp, subids, trial_num, condition,
           proportion_basic_level_responses) %>%
  head() 
```

<br> 

(1) Use logical tests and Boolean operators to return only the rows that contain trials (rows): a) with category as vegetables, b) with category as vehicles and a trial greater than 3, c) with category as vegetables or animals, d) with at least one basic level response in the "one" condition.


## a
```{r,  eval = F}
lf_data %>%
  filter(category == "vegetables") 

```

## b
```{r, , eval = F}
lf_data %>%
  filter(category == "animals", trial_num < 7) 
```

## c
```{r, eval = F}
lf_data %>%
  filter(category == "animals" | category == "vegetables") 
```

OR 

```{r, eval = F}
filter(lf_data, category %in% c("animals", "vegetables"))
```

## d
```{r, , eval = F}
lf_data %>%
  filter(proportion_basic_level_responses > 0, condition == "one") 
```

OR 

```{r, eval = F}
lf_data %>%
  filter(proportion_basic_level_responses >= 0.5, condition == "one")
```

<br> 

(1) The following code selects all trials (rows) where the condition was either "three_subordinate" or "one." Rewrite this code in a way that uses the `%in%` operator. 

```{r, eval = F}
filter(lf_data, condition == "three_subordinate" | condition == "one")
```


```{r, eval = F}
filter(lf_data, condition %in%  c("three_subordinate", "one"))
```

<br> 

(1) How many trials are there where the category is either vegetables or animals? Use `nrow()`.

```{r, , eval = F}
lf_data %>%
  filter(category == "vegetables"| category == "animals") %>%
  nrow()
```

OR

```{r, , eval = F}
lf_data %>%
  filter(category %in% c("vegetables", "animals")) %>%
  nrow()
```

<br> 


(1) The three following sets of commands are written without the pipe operator (`%>%`). Rewrite each one to include the pipe. 

[a]
```{r}
var1 <- mutate(lf_data, category)
```

[b]
```{r}
var1 <- select(lf_data, category)
var2 <- nrow(var1)
```

[c]
```{r}
var1 <- filter(lf_data, trial_num == 1)
var2 <- filter(var1, category == "animals")
var3 <- select(var2, trial_num, category)
```


## a
```{r}
var1 <- lf_data %>%
  mutate(category)
```

## b
```{r}
var1 <- lf_data %>% 
  select(category) %>%
  nrow()
```


## c
```{r}
var1 <- lf_data %>%
  filter(trial_num == 1,
        category == "animals") %>%
  select(trial_num, category)
```



<br> 

(1) The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe. 

[a]
```{r, eval = F}
lf_data %>%
  filter(trial_num < 6) %>%
  nrow()
```

[b]
```{r, eval = F}
lf_data %>%
  select(subids, category, proportion_basic_level_responses) %>%
  filter(subids == 1) %>%
  arrange(category)
```

<br> 

(1) The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe. 

## a
```{r, eval = F}
var1 <- filter(lf_data, trial_num < 6) 
var2 <- nrow(var1)
```

OR

```{r, eval = F}
nrow(filter(lf_data,trial_num < 6))
```


## b
```{r, eval = F}

var1 <- select(lf_data, subids, category, proportion_basic_level_responses) 
var2 <- filter(var1, subids == 1)
var3 <- arrange(var2, category)
```

OR

```{r, eval = F}
arrange(filter(select(lf_data,subids, category, proportion_basic_level_responses), subids == 1), category)
```


<br> 

(1) Look at the code below. Describe in full sentences what this code does.

```{r, eval = F}
lf_data %>%
  select(subids, category, condition) %>%
  filter(category == "vehicles" & condition != "one") %>%
  arrange(-subids)
```

<br> 

(1) On the first day of class, we talked about the "Sally Anne Task" that measures children's understanding of theory of mind ([example videos](https://www.youtube.com/watch?v=oazK2fkRU1A])). Describe four variables that you could measure in this task to assess children's theory of mind performance. Specifically, describe (1) one qualitative variable, (2) one quantitative - binary variable, (3) one quantitative - numeric, and (4) one quantitative -  real variable. For each variable, give a one sentence description of the variable, AND one example value of that variable with units.

**Nora**: qualitative: We could measure how engaged the chid was with the task: whether the child seemed to be paying attention, was distracted, seemed confused, or was fussy.

**Victoria**: Qualitative: location of child’s gaze. This variable would be an observation of where the child is looking after the researcher demonstrates the task. Examples would include: basket, box, or the table (in between the basket and the box).

**Jonathan**: Qualitative variable: A variable that measure the child’s facial expression during the experiment. Possible values could be “interested”, “bored”, “puzzled”, etc.

**Raina**: Qualitative variable: This could be a type of variable that records the type of attention exhibited/displayed by each child who participated in the Sally Anne Task. Example value: A child could have an attention type of ‘attentive’, ‘fidgety’, ‘sleepy’, ‘bored’, etc.

<br> 


(1) Describe the ways in which the scientific process could be described as a "social endeavor". Your answer should make reference to the concepts of "replication" and "reproducibility". Please respond with a short paragraph. 

**Erik**: The scientific process could certainly be described as a “social endeavor”. When someone designs, executes, and publishes an experiment and their findings, other researchers interested in the same field will often look at these publications. Based on the findings, they may want to replicate the experiment in order to reproduce the findings. This is why, when going through the scientific process, it is important to consider replication, including describing the experiment fully and each step in detail. This will increase the reproducibility for other researchers, who may have good ideas to further the findings of the original experiment. This collaboration helps further the findings to adapt previous theories. So, if the original experiment had some theory as a result of the findings, replication of the experiment by others can prove or adapt the theory. If the experiment has reproducable findings, then a theory can be agreed upon. This shows how the scientific process is social, in that while one person may make a theory, it takes many to prove a theory right (reproducibility), or to adapt a theory after replicating an experiment.

**Victoria**: The scientific process can be considered a “social endeavor” because scientists and researchers must work together, collaborate, and share data to ensure that experiments are both replicable and reproducible. Replicability refers to repeating a study with the same hypothesis, experimental design, population, and analysis and achieving the same results. Replication involves communicating with other scientists whose experiments you are trying to replicate to ensure that there are no discrepancies and differing variables between your studies. Reproducibility refers to repeated procedures with replicated experiments that achieve the same results; the experimental plan and necessary code should be published online to ensure that other researchers who want to replicate the experiment have the data they need. Achieving reproducibility involves also communicating with scientists who have done previous research in the field to either confirm and build upon proposed theories if you have reproducible results or formulate new theories to reconciles differing results if you or others have different findings. Overall, the scientific process involves collaboration and sharing of data between researchers to reconcile findings and build upon theories.

**Jaemin**: The scientific process is used to see how the world works by acquiring knowledge in an empirical way by asking a specific question, conducting research in the area, providing a hypothesis, testing the hypothesis, gathering the data, and then report the conclusions. The process as a social endeavor is to solve these kinds of problems in order to complete certain tasks, satisfy needs, and make an advancement in society. However, we must make sure that when we do conduct the process that the results should be replicated, as we can conclude that it is more likely to be valid rather than by chance. The results must also be reproducible, so that we can make sure that we are obtaining consistent results when we use the same kind of input and conduct the same steps. Aiming these kinds of consistent results and being able to replicate these kinds of results is how the scientific process should work so it will be a social endeavor for our society going forward.

**Bethany**: Part of the scientific process relates to the ability to reproduce results. This is important to ensure that the results, and subsequent conclusions, are due to manipulations in the study rather than confounding factors (e.g., unknown third variables, procedural decisions, sample characteristics). This means that the results from studies must be reproducible by others who re-conduct the experiment, known as replication studies. Therefore, the scientific process could be considered as a social endeavor because the ability to replicate and reproduce results depends on dissemination of research and communication between researchers to uncover what the true relationships are between variables.

**Emily**: The scientific process could be described as a social endeavor because a proper scientific study should involve the experiment being able to be replicated by other scientists, and it in fact should be replicated or reproduced by other scientists to ensure that the same results can be found when the experiment is repeated by other people. This allows for any biases by the initial experimenters to be eliminated (e.g. if the initial group of experimenters was looking for a specific outcome and allowed their bias to influence the way they performed the experiment), as well as any errors that they may have made in the process. Replication involves repeating a study with the same exact population, hypothesis, experimental design, and analytic plan, and would ideally result in obtaining the same results as the study you’re modeling off of, while reproducing a study involves repeating the same procedure, by not necessarily with the same population, etc., and would also ideally result in obtaining the same results as the initial study. Because replication and reproducibility of studies are vital to showing that the results they obtain are both valid and reliable, the scientific process must be a social endeavor, since generating a standing theory would therefore involve many different groups of scientists testing the same theory individually, and this means that a true theory can’t really be proved by just one person.