About Myself

Epi/Bios PhD student with Andreas Handel.
I use data analysis and statistical models to study epidemiology and immunology of flu and norovirus.
I started using R and R Markdown in 2017.
I use R in my research to:
- organize data
- visualize results
- fit models
- write reproducible manuscripts
- pretty much everything I do is in R (Markdown), actually.
You can get my contact info at https://wzbillings.com. My office is on the HSC so feel free to take the bus, or just email me. I'm happy to answer any R questions. (Or questions about grad school, or anything else that might help!)

Outline

At the end of this workshop you should be able to…

code using best practices
create an HTML doc with code and commentary
write a function
create and use scripts
calculate summary and inferential statistics of a dataset
create a figure from data

Topics

Introduction to reproducible research
Review of tidyverse

Break
Explore data using a R project
Automating data analysis with functions
Using scripts in your workflow
Wrap Up

THE VERY FIRST THING WE ARE GOING TO DO.

Open R on your computer.
Run install.packages("usethis") to install the usethis package.
Run the command usethis::use_blank_slate(). If successful, it should print a message.
I recommend you do this first thing every time you set up R on a new computer and every time you update R.
We'll talk about why I had you do this in a bit :)

1. What is Reproducible Research (RR)?

Johns Hopkins Bloomberg School of Public Health:

Reproducible Research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.

What is reproducible research?

Three R's: Replication, repeatability, and reproducibility

Replication concerns the number of data points (observations, study subjects, etc.) and establishes the generality of the observed finding to a study population.

Repeatability concerns the ability to arrive at the same findings when a study is repeated and establishes the generality of the observed finding to other study populations or systems.

Reproducibility concerns the reliability of the logic that leads from data to conclusions – that is, the data analysis. It would seem that reproducibility is an essential ingredient of scientific knowledge. But, as data workflow become more and more complicated they also bring more subjective decision-making by the data analyst, more computations, and more opportunities for error.

Why should research be made reproducible?

Four public virtues and one private one

Enables readers to develop a complete and exact understanding of methods
Enables error detection and correction
Facilitates critique
Enables extension and advancement
Saves time (private)

How can research be made reproducible?

Three concepts

Workflow/Dataflow is made reproducible by file organization and coding practices
Literate statistical programming is incorporated into research by blending computer code and with natural language explanations
Science archive must be used for public virtues to be realized

Workflow is...

the idea that data analysis may be viewed as a repeatable pattern of computational activities. If these activities are truly repeatable and truly computational, then they may be encoded in an algorithm (programming). This principle of repeatable computation is the key to reproducible research. In reproducible research, a computer program is written to perform an analysis.

workflow Organization is key

What should be included in RR?

What would you need to recreate an experimental project?

What would you need to recreate a modeling project?

What should be included in RR?

What would you need to recreate an experimental project?

Experimental details: experimental design, sampling protocol, etc.
Analysis: R code
What else?

What would you need to recreate a modeling project?

Experimental details: model assumptions, number of simulations, etc.
Analysis: R code
What else?

What should be included in RR?

What would you need to recreate an experimental project?

Experimental details: experimental design, sampling protocol, etc.
Analysis: R code
What else?

Where is the why ?

What would you need to recreate a modeling project?

Experimental details: model assumptions, number of simulations, etc.
Analysis: R code
What else?

Where is the 'Why'?

Within your code your comments and longer commentary should include the why instead of the what or how.

Is this an example of good commenting the why?

passengers %>% # start with Titanic passengers data set
  filter(Sex == "male") %>% # pull out male passengers
  ggplot(aes(x= Age, y =Fare)) + # plot age on x axis and fare on y axis
  geom_point() # plot the data as points

Literate programming

The best code is “self-documenting.” It should be so easy to read that you don't need to comment what it does.
However, you still need to explain WHY you are doing it – computer scientists call this paradigm “literate programming.”

An example of better comments with commentary:

I hypothesize that older men will pay a higher fare because they have had more time to accumulate more wealth. The first step of testing this hypothesis is to determine if there is a possitive correlation between age and fare for male passengers.

# create a plot to see how male passengers' age correlated with the fare they paid. 
passengers %>% 
  filter(Sex == "male") %>% 
  ggplot(aes(x= Age, y =Fare)) + 
  geom_point()

2. Review of tidyverse

The tidyverse is a set of packages that work in harmony because they share common design.
Unique syntax used pipes (%>%) to connect data (object) to verbs (functions).

tidy syntax

tidyverse package

The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command. Last workshop we transformed and visualized data, this time we will also incorporate R Markdown.

tidy verse package

Double click “W2_Exercise.Rproj” to practice commenting and literate programming

Open the `W2_ExerciseG_.Rmd` with your group number in the file name/

Questions to reflect on...

What did you include in your commentary? in comments? why?
What was challenging about documenting the other group's work?
How does your document compare to other group members? Did you include similar information?
What changes (if any) did you make to the markdown code to format the html output?

1. Introduction to reproducible research
2. Review of tidyverse

Break

Explore data using a R project
Automating data analysis with functions
Wrap Up

3. Creating an R project

I think all of you should read this blog post: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/

3. Creating an R project

Now, we are going explore data starting from scratch. Let's start a new project.

Some great reasons to use projects at http://bit.ly/WhyUseProjects

Exploring the Global Mammal Parasite Database

It is now your turn to start a R project to explore the Global Mammal Parasite Database (GMPD).

Start a new project from a new directory called GMPD
Move the contents of W2_Exercise\GMPD_data into the appropriate location

Organization is key

What is the Global Mammal Parasite Database?

Take the next few minutes to review:

Check out the dataset publication at http://bit.ly/RworkshopGMPD under Supporting Information
Read the abstract of the metadata.pdf file
Read about the file formats on page 14-19
Open the data files, match the column titles to the metadata explaination

You've been tasked to explore the data set

Your research group is particularly interested in:

Which of the 3 groups (carnivores, primates, or ungulates) are sickest? (hint: use the main.csv file)
Are parasites transmitted by vectors also likely to have complex life cycles compared to non-vectored parasites? (hint: use the traits.csv file)

Be sure to use comments and commentary to explain your process.

Exploring the Global Mammal Parasite Database

You might start by loading the needed packages and data.

library(tidyverse)
gmp <- read_csv("data/GMPD_main.csv")

use the cheat sheets, W1_exercise.Rmd and each other for help!

Q1. Which group is sicker?

I'm using mean prevalence as a measurement of health. The mean prevalence is calculated for all reported parasites, ignoring parasite type. This might be an assumption to explore later. The sample sizes are very different so I will also calculate standard error to use when comparing the mean values.

gmp %>%
  drop_na(Prevalence) %>%
  group_by(Group) %>%
  summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n())) %>%
  ggplot(aes(x = Group, y = MeanPrev)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = MeanPrev-SdErrPrev, ymax = MeanPrev+SdErrPrev), width = 0.2)

what is the output of this code chunk?

Q1. Important side note

Notice how this code is quite wide! This can mark your R Markdown document look bad and can actually make the code overflow off the side of the page :)

I recommend that, in R Studio, you go to Tools -> Global Options -> Code section -> Display tab -> enable “Show Margin” and set the number to 80 characters. This is a good default that will prevent most size 12 monospaced fonts from overflowing.

gmp %>%
  drop_na(Prevalence) %>%
  group_by(Group) %>%
  summarise(
    MeanPrev = mean(Prevalence),
    SdErrPrev = sd(Prevalence)/sqrt(n())
  ) %>%
  ggplot(aes(x = Group, y = MeanPrev)) +
  geom_bar(stat = "identity") +
  geom_errorbar(
    aes(
      ymin = MeanPrev-SdErrPrev,
      ymax = MeanPrev+SdErrPrev
    ),
    width = 0.2
  )

Q1. Answer

gmp %>%
  drop_na(Prevalence) %>%
  group_by(Group) %>%
  summarise(
    MeanPrev = mean(Prevalence),
    SdErrPrev = sd(Prevalence)/sqrt(n())
  ) %>%
  ggplot(aes(x = Group, y = MeanPrev)) +
  geom_bar(stat = "identity") +
  geom_errorbar(
    aes(
      ymin = MeanPrev-SdErrPrev,
      ymax = MeanPrev+SdErrPrev
    ),
    width = 0.2
  )

plot of chunk unnamed-chunk-5

Q2. Do vectored parasites also have intermediate hosts?

I am interested in the proportion of parasites with complex life histories or intermediate hosts. I think that vectored parasites as compared to non-vectored parasites are more likely to have an intermediate host. For the first pass, I want to visualize the number of vectored parasites.

gmpTraits <- read_csv("data/GMPD_parasite_traits.csv")

gmpTraits %>%
  mutate(vector = ifelse(vector == 0, "No", "Yes")) %>%
  mutate(intermediate = ifelse(intermediate == 0, "No", "Yes")) %>%
  ggplot(aes(x = vector, fill = intermediate)) +
  geom_bar()

what happens if an ifelse statement isn't used?

Q2. Do vectored parasites also have intermediate hosts?

gmpTraits %>%
  mutate(vector = ifelse(vector == 0, "No", "Yes")) %>%
  mutate(intermediate = ifelse(intermediate == 0, "No", "Yes")) %>%
  ggplot(aes(x = vector, fill = intermediate)) +
  geom_bar()

plot of chunk unnamed-chunk-8

4. Automating data analysis with functions

Sometimes we have a set of analyses that needs to be done multiple times. Instead of a copy & paste approach, we can write a function.
Functions can help you make less errors and also make your code easier to understand–you will often hear this called the DRY (Don't Repeat Yourself) principle.

automation

Anatomy of a function

function_name <- function(argument1, argument2, ...){
  #some analysis
  return(output_of_function)
}

Function Names

Function names should…

be verbs describing what the function will do.
use snake_case_for_naming
Previous versions of these slides recommended camelCaseForNaming, but the tidyverse style guide (recommended reading!) recommends snake_case because it is more accessible to screen readers, ESL speakers, and people with dyslexia. You never know who is going to look at your code next, so it should be as accessible as possible!

GOOD: calculate_avg_clicks

BAD: CalculateAvgClicks, calculateAvgClicks, cALCULATEaVGcLICKS

When you are writing functions, give your code room to breathe! You don't get a prize for making your code as short as possible, so make your code easier to read.

An Example

Move W2_Functions.Rmd from the W2_Exercise folder into your GMPD folder containing the GMPD.Rproj

Open the .Rmd once it is in the correct location.

An Example

scenario: In addition to the GMPD, there are also parasite datasets for cetaceans and fish among others. They all follow the same format and column naming system. You've been asked to continue exploring which groups are sickest (Q1) using these additional datasets.

task: Write a function that will automate these calculations for each dataset.

An Example

scenario: In addition to the GMPD, there are also parasite datasets for cetaceans and fish among others. They all follow the same format and column naming system. You've been asked to continue exploring which groups are sickest (Q1) using these additional datasets.

task: Write a function that will automate these calculations for each dataset.

step 1: Determine what the function should do. Maybe try out the analysis on a sample dataset

Step 1. Sketching out the function analysis

we have code that will do this for the gmp dataset

gmp %>%
  drop_na(Prevalence) %>%
  group_by(Group) %>%
  summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))

# A tibble: 3 x 3
  Group      MeanPrev SdErrPrev
  <chr>         <dbl>     <dbl>
1 carnivores    0.388   0.0141 
2 primates      0.328   0.00534
3 ungulates     0.500   0.00706

Step 2. Decide how to generalize the code

What names will change with each dataset?
What will stay the same?

gmp %>%
  drop_na(Prevalence) %>%
  group_by(Group) %>%
  summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))

Step 3. Put into function format

 prev_by_par_type <- function(dataset){
  dataset %>%
    drop_na(Prevalence) %>%
    group_by(Group) %>%
    summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))
 }

Step 4. Try it out

 prev_by_par_type <- function(dataset) {
  dataset %>%
    drop_na(Prevalence) %>%
    group_by(Group) %>%
    summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))
 }

prev_by_par_type(gmp)

# A tibble: 3 x 3
  Group      MeanPrev SdErrPrev
  <chr>         <dbl>     <dbl>
1 carnivores    0.388   0.0141 
2 primates      0.328   0.00534
3 ungulates     0.500   0.00706

Tidyverse Challenge

Modify this function so that the mean prevalence is calculated for each host group and parasite type.

prev_by_par_type <- function(dataset) {
  dataset %>%
   # calculate mean prevalence for host group and parasite type
 }

Tidyverse Challenge: Solution

Modify this function so that the mean prevalence is calculated for each parasite type.

 prev_by_par_type <- function(dataset) {
  dataset %>%
    drop_na(Prevalence) %>%
    # add second grouping variable
    group_by(
      Group,
      ParType
    ) %>%
    summarise(
      MeanPrev = mean(Prevalence),
      SdErrPrev = sd(Prevalence)/sqrt(n())
    )
 }

Important Caveat

When you want to use tidyverse functions in your custom function, you cannot use a column name as a function argument.
If you try the following function and pass in prevalence to the outcome parameter, you will get an error!

 prev_by_par_type <- function(dataset, outcome) {
  dataset %>%
    drop_na(outcome) %>%
    # add second grouping variable
    group_by(
      Group,
      ParType
    ) %>%
    summarise(
      MeanPrev = mean(outcome),
      SdErrPrev = sd(outcome)/sqrt(n())
    )
 }

Tidyverse uses “non-standard evaluation.” (As opposed to base R's “standard evaluation.”) This is possible to do, but it requires a discussion of complicated programming topics–you can read about this online if you are interested.

Returning to RR

Dataflow is the idea that when a variable changes, downstream computations affected by that variable should change as well.

So in this case, we've written a function that can take in different datasets or updated versions and adjust the output automatically*

*as long as the column names are consistent.

5. Using scripts in your workflow

We've touched on organization earlier. Let's focus on the scripts folder.

flow

Keeping a clean workflow

If you are writing functions that:

are long
used for different parts of your research
won't be changing often.

Then, a script might make for clean workflow.

flow

What is a script?

stable code often consisting of functions
is saved as .R file type
is loaded into the R environment using the source() command
simplifies your .Rmd files

Make a script

Create a .R file
Make some comments at the start of the file, write your functions
Save file in scripts directory with an informative name

Make a script

Create a .R file

Make a script

Create a .R file
Make some comments at the start of the file

# Written on May 11, 2018 by Reni Kaul
# This script contains functions for the R workshop; REU popbio program. 
# Originally written to be part of the W2_Functions.Rmd exploration

prev_by_par_type <- function(dataset){
  dataset %>%
    drop_na(Prevalence) %>%
    group_by(Group) %>%

Script recommendations:

Never start a script with rm(list = ls())
Never start a script with setwd(my_folder_that_no_one_else_has)
Don't install packages in a script. Use library() or another way of declaring packages, and the users will know what they need to install.

Make a script

Create a .R file
Make some comments at the start of the file
Save file in scripts directory with an informative name

What is good name for this script?
Create a script folder inside the GMPD project and save your script

Use a script

only for this demo Session > Restart R

Source the script

source("scripts/YOURFILE.R")

What's in the Environment tab?
Note: you should treat everything in the environment tab as ephemeral. The only thing that actually exists is your scripts and markdown files. If running your code doesn't re-create an object, that object doesn't actually exist.

Other RR recommendations

We have barely scratched the surface of reproducible research!
Especially if you are new to programming, this is a GREAT place to start, and if you follow these tips, your code will be well above average.
If you are an experienced programmer, you may be interested in looking into Git/GitHub, renv, targets, and inline R coding for Markdown files. Using these packages, you can automate and future-proof an entire project so that it “just works” when someone else downloads it.
We covered a lot today and you should be proud of yourself! Please contact me with any questions!!!

6. Wrap Up

We can…

code using best practices
create an HTML doc with code and commentary
write a function
create and use scripts
calculate summary statistics of a dataset
create a figure from data

Reminder: complete project worksheet

6. Wrap Up

Next

Modeling
- simulating data
- fitting models to data

Reproducible Research with R

About Myself

Outline

THE VERY FIRST THING WE ARE GOING TO DO.

1. What is Reproducible Research (RR)?

What is reproducible research?

Why should research be made reproducible?

How can research be made reproducible?

Workflow is...

What should be included in RR?

What should be included in RR?

What should be included in RR?

Where is the 'Why'?

Literate programming

2. Review of tidyverse

tidyverse package

Double click “W2_Exercise.Rproj” to practice commenting and literate programming Open the W2_ExerciseG_.Rmd with your group number in the file name/

Questions to reflect on...

Break

3. Creating an R project

3. Creating an R project

Exploring the Global Mammal Parasite Database

What is the Global Mammal Parasite Database?

You've been tasked to explore the data set

Exploring the Global Mammal Parasite Database

Q1. Which group is sicker?

Q1. Important side note

Q1. Answer

Q2. Do vectored parasites also have intermediate hosts?

Q2. Do vectored parasites also have intermediate hosts?

4. Automating data analysis with functions

Anatomy of a function

Function Names

An Example

An Example

An Example

Step 1. Sketching out the function analysis

Step 2. Decide how to generalize the code

Step 3. Put into function format

Step 4. Try it out

Tidyverse Challenge

Tidyverse Challenge: Solution

Important Caveat

Returning to RR

5. Using scripts in your workflow

Keeping a clean workflow

What is a script?

Make a script

Make a script

Make a script

Make a script

Use a script

Other RR recommendations

6. Wrap Up

6. Wrap Up

Double click “W2_Exercise.Rproj” to practice commenting and literate programming

Open the `W2_ExerciseG_.Rmd` with your group number in the file name/