Reni Kaul & Deven Gokhale & Zane Billings
2022-06-06
At the end of this workshop you should be able to…
Topics
Review of tidyverse
Break
Explore data using a R project
Automating data analysis with functions
Using scripts in your workflow
Wrap Up
R
on your computer.install.packages("usethis")
to install the usethis
package.usethis::use_blank_slate()
. If successful, it should print a message.Johns Hopkins Bloomberg School of Public Health:
Reproducible Research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.
Three R's: Replication, repeatability, and reproducibility
Replication concerns the number of data points (observations, study subjects, etc.) and establishes the generality of the observed finding to a study population.
Repeatability concerns the ability to arrive at the same findings when a study is repeated and establishes the generality of the observed finding to other study populations or systems.
Reproducibility concerns the reliability of the logic that leads from data to conclusions – that is, the data analysis. It would seem that reproducibility is an essential ingredient of scientific knowledge. But, as data workflow become more and more complicated they also bring more subjective decision-making by the data analyst, more computations, and more opportunities for error.
Four public virtues and one private one
Three concepts
the idea that data analysis may be viewed as a repeatable pattern of computational activities. If these activities are truly repeatable and truly computational, then they may be encoded in an algorithm (programming). This principle of repeatable computation is the key to reproducible research. In reproducible research, a computer program is written to perform an analysis.
Organization is key
What would you need to recreate an experimental project?
What would you need to recreate a modeling project?
What would you need to recreate an experimental project?
R
codeWhat would you need to recreate a modeling project?
R
codeWhat would you need to recreate an experimental project?
R
codeWhere is the why ?
What would you need to recreate a modeling project?
R
codeWithin your code your comments and longer commentary should include the why instead of the what or how.
Is this an example of good commenting the why?
passengers %>% # start with Titanic passengers data set
filter(Sex == "male") %>% # pull out male passengers
ggplot(aes(x= Age, y =Fare)) + # plot age on x axis and fare on y axis
geom_point() # plot the data as points
An example of better comments with commentary:
I hypothesize that older men will pay a higher fare because they have had more time to accumulate more wealth. The first step of testing this hypothesis is to determine if there is a possitive correlation between age and fare for male passengers.
# create a plot to see how male passengers' age correlated with the fare they paid.
passengers %>%
filter(Sex == "male") %>%
ggplot(aes(x= Age, y =Fare)) +
geom_point()
%>%
) to connect data (object) to verbs (functions). The tidyverse package is designed to make it easy to install and load core packages from the tidyverse
in a single command. Last workshop we transformed and visualized data, this time we will also incorporate R Markdown.
Open the W2_ExerciseG_.Rmd
with your group number in the file name/
1. Introduction to reproducible research
2. Review of tidyverse
I think all of you should read this blog post: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
Now, we are going explore data starting from scratch. Let's start a new project.
Some great reasons to use projects at http://bit.ly/WhyUseProjects
It is now your turn to start a R project to explore the Global Mammal Parasite Database (GMPD).
GMPD
Move the contents of W2_Exercise\GMPD_data
into the appropriate location
Organization is key
Take the next few minutes to review:
metadata.pdf
file Your research group is particularly interested in:
Be sure to use comments and commentary to explain your process.
You might start by loading the needed packages and data.
library(tidyverse)
gmp <- read_csv("data/GMPD_main.csv")
W1_exercise.Rmd
and each other for help! I'm using mean prevalence as a measurement of health. The mean prevalence is calculated for all reported parasites, ignoring parasite type. This might be an assumption to explore later. The sample sizes are very different so I will also calculate standard error to use when comparing the mean values.
gmp %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n())) %>%
ggplot(aes(x = Group, y = MeanPrev)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = MeanPrev-SdErrPrev, ymax = MeanPrev+SdErrPrev), width = 0.2)
what is the output of this code chunk?
Notice how this code is quite wide! This can mark your R Markdown document look bad and can actually make the code overflow off the side of the page :)
I recommend that, in R Studio, you go to Tools -> Global Options -> Code section -> Display tab -> enable “Show Margin” and set the number to 80 characters. This is a good default that will prevent most size 12 monospaced fonts from overflowing.
gmp %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
summarise(
MeanPrev = mean(Prevalence),
SdErrPrev = sd(Prevalence)/sqrt(n())
) %>%
ggplot(aes(x = Group, y = MeanPrev)) +
geom_bar(stat = "identity") +
geom_errorbar(
aes(
ymin = MeanPrev-SdErrPrev,
ymax = MeanPrev+SdErrPrev
),
width = 0.2
)
gmp %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
summarise(
MeanPrev = mean(Prevalence),
SdErrPrev = sd(Prevalence)/sqrt(n())
) %>%
ggplot(aes(x = Group, y = MeanPrev)) +
geom_bar(stat = "identity") +
geom_errorbar(
aes(
ymin = MeanPrev-SdErrPrev,
ymax = MeanPrev+SdErrPrev
),
width = 0.2
)
I am interested in the proportion of parasites with complex life histories or intermediate hosts. I think that vectored parasites as compared to non-vectored parasites are more likely to have an intermediate host. For the first pass, I want to visualize the number of vectored parasites.
gmpTraits <- read_csv("data/GMPD_parasite_traits.csv")
gmpTraits %>%
mutate(vector = ifelse(vector == 0, "No", "Yes")) %>%
mutate(intermediate = ifelse(intermediate == 0, "No", "Yes")) %>%
ggplot(aes(x = vector, fill = intermediate)) +
geom_bar()
what happens if an ifelse
statement isn't used?
gmpTraits %>%
mutate(vector = ifelse(vector == 0, "No", "Yes")) %>%
mutate(intermediate = ifelse(intermediate == 0, "No", "Yes")) %>%
ggplot(aes(x = vector, fill = intermediate)) +
geom_bar()
function_name <- function(argument1, argument2, ...){
#some analysis
return(output_of_function)
}
Function names should…
snake_case_for_naming
camelCaseForNaming
, but the tidyverse style guide (recommended reading!) recommends snake_case because it is more accessible to screen readers, ESL speakers, and people with dyslexia. You never know who is going to look at your code next, so it should be as accessible as possible!GOOD: calculate_avg_clicks
BAD: CalculateAvgClicks
, calculateAvgClicks
, cALCULATEaVGcLICKS
Move W2_Functions.Rmd
from the W2_Exercise folder into your GMPD folder containing the GMPD.Rproj
Open the .Rmd
once it is in the correct location.
scenario: In addition to the GMPD, there are also parasite datasets for cetaceans and fish among others. They all follow the same format and column naming system. You've been asked to continue exploring which groups are sickest (Q1) using these additional datasets.
task: Write a function that will automate these calculations for each dataset.
scenario: In addition to the GMPD, there are also parasite datasets for cetaceans and fish among others. They all follow the same format and column naming system. You've been asked to continue exploring which groups are sickest (Q1) using these additional datasets.
task: Write a function that will automate these calculations for each dataset.
step 1: Determine what the function should do. Maybe try out the analysis on a sample dataset
gmp
datasetgmp %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))
# A tibble: 3 x 3
Group MeanPrev SdErrPrev
<chr> <dbl> <dbl>
1 carnivores 0.388 0.0141
2 primates 0.328 0.00534
3 ungulates 0.500 0.00706
gmp %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))
prev_by_par_type <- function(dataset){
dataset %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))
}
prev_by_par_type <- function(dataset) {
dataset %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
summarise(MeanPrev = mean(Prevalence), SdErrPrev = sd(Prevalence)/sqrt(n()))
}
prev_by_par_type(gmp)
# A tibble: 3 x 3
Group MeanPrev SdErrPrev
<chr> <dbl> <dbl>
1 carnivores 0.388 0.0141
2 primates 0.328 0.00534
3 ungulates 0.500 0.00706
Modify this function so that the mean prevalence is calculated for each host group and parasite type.
prev_by_par_type <- function(dataset) {
dataset %>%
# calculate mean prevalence for host group and parasite type
}
Modify this function so that the mean prevalence is calculated for each parasite type.
prev_by_par_type <- function(dataset) {
dataset %>%
drop_na(Prevalence) %>%
# add second grouping variable
group_by(
Group,
ParType
) %>%
summarise(
MeanPrev = mean(Prevalence),
SdErrPrev = sd(Prevalence)/sqrt(n())
)
}
prevalence
to the outcome
parameter, you will get an error! prev_by_par_type <- function(dataset, outcome) {
dataset %>%
drop_na(outcome) %>%
# add second grouping variable
group_by(
Group,
ParType
) %>%
summarise(
MeanPrev = mean(outcome),
SdErrPrev = sd(outcome)/sqrt(n())
)
}
Dataflow is the idea that when a variable changes, downstream computations affected by that variable should change as well.
So in this case, we've written a function that can take in different datasets or updated versions and adjust the output automatically*
*as long as the column names are consistent.
We've touched on organization earlier. Let's focus on the scripts folder.
If you are writing functions that:
Then, a script might make for clean workflow.
.R
file typesource()
command.Rmd
files .R
file .R
file .R
file # Written on May 11, 2018 by Reni Kaul
# This script contains functions for the R workshop; REU popbio program.
# Originally written to be part of the W2_Functions.Rmd exploration
prev_by_par_type <- function(dataset){
dataset %>%
drop_na(Prevalence) %>%
group_by(Group) %>%
Script recommendations:
rm(list = ls())
setwd(my_folder_that_no_one_else_has)
library()
or another way of declaring packages, and the users will know what they need to install..R
file GMPD
project and save your scriptonly for this demo Session > Restart R
Source the script
source("scripts/YOURFILE.R")
renv
, targets
, and inline R coding for Markdown files. Using these packages, you can automate and future-proof an entire project so that it “just works” when someone else downloads it.We can…
Reminder: complete project worksheet
Next