This place will be our new discussion panel regarding all issues on the R User’s Group. Please invite everyone who is interested.
Cheers,
Ina and Rene
This place will be our new discussion panel regarding all issues on the R User’s Group. Please invite everyone who is interested.
Cheers,
Ina and Rene
Reading and Writing Data in R
Making publication quality plots in R
And many more…
First ~30min will be covered by a speaker to introduce a new topic.
After that, 30min are scheduled to work on the new topic (speaker provide R script)
Finally, we aim at 60min of hacking an applied problem. The problem can be submitted by a user in advance to the meeting or, if no problem will be announced, we can just go on working on the
topic presented that day.
Looking forward seeing as many as possible of you guys in 2017.
Cheers,
Ina and Rene
##Workshop helpers for the Software Carpentry Bootcamp needed!
Hello R Users,
there is a great chance for everyone who wants to become more familiar with R and meet people who might be a valuable contact for your R career. Emily Brennan from the Office of the Deputy Vice-Chancellor (Research) is looking for helpers for the Software Carpentry Bootcamp here at MQ on Thursday 18 & Friday 19 of May.
Please get in contact with Emily if you are interested and would like to have more information.
emily.brennan@mq.edu.au
Hope to see you at the Bootcamp,
Rene
May 2017
This month’s R Users Group meeting tries to provide an overview of the options available in Hadley Wickham’s ggplot2 and in R’s in-built alternative base graphics. I (Rene Heim) will guide you through an article by Nathan Yau which was posted on flowingdata.com in March 2016 to equip you with some tools to make your own decision, ggplot2, base graphics, or both. Let’s meet on Tuesday the 30th of May 2017 at 3 pm in the Biology Tearoom as usual. The script I am going to use will be provided on our Quantitative Advice forum for you to follow along (R Users Group Macquarie Uni).
See you then…
Rene
####
# Introduction
#
# This script is created to provide an overview of the functions used in Hadley Wickham's
# "ggplot2" and the inbuilt option in R to plot graphics.
#
# Many R Users seemingly want to make a decision about which of the above-mentioned options is better.
# Personally, I try to use functions, packages, and guidelines that get me where I want to go. Let's
# see if the following examples help you to find your own way.
#
# **My examples and opinions are based on the following resources:**
#
# * R Graphics Cookbook by O'Reilly (http://www.cookbook-r.com/Graphs/)
# * Comparing ggplot2 and R Base Graphics by Nathan Yau (https://flowingdata.com/2016/03/22/comparing-ggplot2-and-r-base-graphics/)
# * ggplot2 Elegant Graphics for Data Analysis by Wickham (http://www.springer.com/us/book/9780387981413)
# * Why I don't use ggplot2 by Jeff Leek (https://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/)
# * Why I use ggplot2 by David Robinson (http://varianceexplained.org/r/why-I-use-ggplot2/)
#
# **Further I would like to recommend following resources to learn either of the shown plotting options:**
#
# * R base graphics cheat sheet (http://publish.illinois.edu/johnrgallagher/files/2015/10/BaseGraphicsCheatsheet.pdf)
# * ggplot2 cheat sheet (https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
# * ggplot2 guideline (http://ggplot2.tidyverse.org/reference/)
# * R Base Graphics: An Idiot's Guide (http://rstudio-pubs-static.s3.amazonaws.com/7953_4e3efd5b9415444ca065b1167862c349.html)
####
####
# 1. The Bar Chart
####
# Load Libraries
install.packages("reshape2")
install.packages("ggplot2")
library(reshape2) #contains a dataset we will use
library(ggplot2)
# Create Data
data <- data.frame(
time = factor(c("Lunch","Dinner"), levels=c("Lunch","Dinner")),
total_bill = c(14.89, 17.23)
)
# ggplot 2 example
ggplot(data=data, aes(x=time, y=total_bill)) +
geom_bar(colour="black", fill="#DD8888", width=.8, stat="identity") +
xlab("Time of day") + ylab("Total bill") +
ggtitle("Average bill for 2 people")+
theme_bw() #Nice to change preferred backgroud and specify appearance (ggthemes)
?theme_bw
# Check out tab auto completion...Type -> theme and check out options
# R base graphics example
par(las=1) #Positioning of the axis labels (either 0,1,2 or 3)
barplot(data$total_bill,
names.arg=data$time, #Can take a vector of elements you want to use as labels e.g. names.arg=c("A","B","C","D","E","F","G")
col="#AFC0CB",
border=TRUE,
main="Average Bill for Two-Person Meal",
axisnames=TRUE) #must be TRUE for names.arg to work
# You could do this as well xlab="LETTERS", ylab="MY Y VALUES"
####
# If you’re unfamiliar with ggplot2, it implements a “grammar of graphics” based on Leland Wilkinson’s
# book The Grammar of Graphics. In ggplot you can split a chart into graphical objects — data, scale,
# coordinate system, and annotation — and think about them separately. When you put it all together
# (using a + operator), you get a complete chart.
#
# On the other hand, you use the barplot() function with base graphics and specify everything in
# the function arguments. A single function is used and contains all arguments.
#
# Is this important for us??
####
# In ggplot2, you specify a count by day through aes() and geom_bar().
data2<- reshape2::tips #Getting the data.
ggplot(data=data2, aes(x=day)) +
geom_bar(stat="count")
# Check it out what happens:
summary(subset(data2, day=="Sat"))
# In base graphics, you work with the data outside of the visualization functions. In this case,
# you can use table() to aggregate by day, and you pass that result to barplot().
tips_per_day <- table(data2$day)#The table() command creates a simple table of counts of the elements in a data set.
barplot(tips_per_day) #ggplot() and plot() are generic whereas qplot() isn't
?table
####
# 2. Line Chart
####
ggplot(data=data, aes(x=time, y=total_bill, group=1)) +
geom_line(colour="red", linetype="dashed", size=1.5) +
geom_point(colour="red", size=4, shape=21, fill="white")
# Why using group? It has something to do whether our variables are numeric or factors.
# Sometimes numeric can be interpreted by R having a logical order. Check out:
# https://stackoverflow.com/questions/10357768/plotting-lines-and-the-group-aesthetic-in-ggplot2
# for more information.
plot(c(1,2), data$total_bill, type="l", xlab="time", ylab="",
lty=2, lwd=3, col="red")
points(c(1,2), data$total_bill, pch=21, col="red", cex=2,
bg="white", lwd=3)
# Maybe nicer in ggplot: base R aesthetic names (col, pch, cex, etc.), while having
# more descriptive ggplot2 aesthetic names (colour, shape and size).
?par #Set or Query Graphical Parameters
####
# 3. Multiple Variables
####
# Create data
data3 <- data.frame(
sex = factor(c("Female","Female","Male","Male")),
time = factor(c("Lunch","Dinner","Lunch","Dinner"), levels=c("Lunch","Dinner")),
total_bill = c(13.53, 16.81, 16.24, 17.42)
)
ggplot(data=data3, aes(x=time, y=total_bill, fill=sex)) +
geom_bar(stat="identity", position=position_dodge(),
colour="black") +
scale_fill_manual(values=c("#999999", "#E69F00")) #Check out colour picker!!
# Base Graphics
dat1mat <- matrix( data3$total_bill,
nrow = 2,
byrow=TRUE,
dimnames = list(c("Female", "Male"), c("Lunch", "Dinner"))
)
mf_col <- c("#3CC3BD", "#FD8210")
barplot(dat1mat, beside = TRUE, border=NA, col=mf_col)
legend("topleft", row.names(dat1mat), pch=15, col=mf_col) #legend comes in ggplot automatically
####
# 5. Facets
####
sp <- ggplot(data2, aes(x=total_bill, y=tip/total_bill))+ geom_point(shape=1)
sp
sp + facet_grid(. ~ sex)
# Easy with only two plots in ggplot...now base graphics..we need a loop. Or we plot two plots..
par(mfrow=c(1,2))
sexes <- unique(data2$sex)
for (i in 1:length(sexes)) {
currdata <- data2[data2$sex == sexes[i],]
plot(currdata$total_bill, currdata$tip/currdata$total_bill,
main=sexes[i], ylim=c(0,0.7))
}
# But with more than two?
sp <- ggplot(data2, aes(x=total_bill, y=tip/total_bill)) + geom_point(shape=1)
sp + facet_grid(sex ~ day)
# And base graphics
par(mfrow=c(2,4))
days <- c("Thur", "Fri", "Sat", "Sun")
sexes <- unique(data2$sex)
for (i in 1:length(sexes)) {
for (j in 1:length(days)) {
currdata <- data2[data2$day == days[j] & data2$sex == sexes[i],]
plot(currdata$total_bill, currdata$tip/currdata$total_bill,
main=paste(days[j], sexes[i], sep=", "), ylim=c(0,0.7), las=1)
}
}
####
#END
#
#For further questions pls contact me: rene.heim@hdr.mq.edu.au
####
June 2017
In the upcoming MQ R users meeting, we will welcome Dr. Ben Marwick from the University of Washington, Seattle. He is an archaeologist, and an expert in reproducibility in research broadly. For the next meeting, Ben will show how to enhance the reproducibility of your research using R and related tools such as Git, Travis, and Docker. He will illustrate with examples from his published articles. We start from how to make a basic research compendium and look at some more advanced tooling. As always, everybody is welcome to join us on Tuesday, the 27th of June 2017 in the Biology Tearoom at 3 pm. Ben will talk for roughly 30-60 min and there will be plenty of time for questions and answers.
For everyone who is new to R and would like to get started or just needs a refresher, we are offering our second R introduction on the same day, 27th of June, 1:30 pm in the Biology Tearoom.
(We will use Ina’s talk about the visualization of ordination analysis to compensate for unexpected speaker cancellations.)
See you then
Ina, James, Kaja and Rene
July 2017
The versatility of R
In the upcoming MQ R users meeting Arvind Hughes, an HDR student at MQ, will talk about the versatility of R. He will show us applications in astrophysics, horse racing or the property market. As always, everybody is welcome to join us on Tuesday, the 25th of July 2017 in the Biology Tearoom at 3 pm. Arv will talk for roughly 30-60 min and there will be plenty of time for questions, answers and problem-solving. Bring your R problems!!! Everyone is welcome…
See you then
Ina, James, Kaja and Rene
P.S. Are you a student or academic staff? Do you want to get involved? Organizing the R users group events in 2018? Let us know!
• kaja.wierucka@hdr.mq.edu.au
• ina.geedicke@hdr.mq.edu.au
• james.lawson@mq.edu.au
• rene.heim@hdr.mq.edu
And here is the code about “Matching”, provided by Thierry! Have fun!
# Estimation of the Average Treament Effect (tau) of a binary exposure variable (Z) on a Gaussian outcome (Y) using matching.
## In this script, we simulate observational data and evaluate the performance of different matching techniques.
## Observational data means that some subjects are more likely to be exposed to the treatment than others (i.e. exposure mechanism is not random like in randomised experiments)
## To estimate the ATE, we want to compare the outcome between the two treatment groups (control Z=0 and treated Z=1). However, we need to make sure that we are comparing similar subjects to avoid confounding.
## Matching enables to extract a sample of subjects that have similar covariate distribution. This enables to elminate confounding and get an unbiased estimate of the ATE.
## Matching can be done directly on observed covariates or on the propensity score depending on the knowledge of the true confounding factors and the dimension of the dataset.
rm(list=ls())
library(Matching)
setwd('/Users/mq20160354/Google Drive/R')
source('df_generate.R')
## Loop over data samples (assess impact of sampling noise)
Nsamp = 20
tau_relBias = matrix(0, Nsamp, 5)
for (k in 1:Nsamp) {
## ------------------------------------------------------------------------------------------------------
## Simualte data
## X: n x p matrix of pre-treatment covariates
## Z: Binary exposure indicator
## Y: Continuous outcome variable
## ------------------------------------------------------------------------------------------------------
n = 500
p = 9
seed = k #= sample(1:50,1)
ls = df_generate(n, p, seed)
df = ls[['df']]
ps_true = ls[['ps']]
tau_true = ls[['tau']]
rm(ls)
## ------------------------------------------------------------------------------------------------------
## Explore covariate balance and outcome difference between treatment groups
## ------------------------------------------------------------------------------------------------------
## Outcome difference-in-means (Naive estimation of tau)
tau_hat = mean(df$Y[df$Z==1]) - mean(df$Y[df$Z==0])
tau_relBias_naive = abs(tau_hat - tau_true)/tau_true * 100
## Covariate difference-in-means
lapply(1:p, function(j) {
t.test(df[,j] ~ df$Z)
})
par(mfrow=c(3,3))
for (j in 1:p) {
hist(df[df$Z==0,j], main = paste(colnames(df)[j]), xlab = '')
hist(df[df$Z==1,j], add = T, col = 'grey')
}
## ------------------------------------------------------------------------------------------------------
## Estimate propensity score (probability of being exposed conditional on the baseline covariates)
## ------------------------------------------------------------------------------------------------------
## Logistic regression of Z on X
fml = as.formula(paste('Z~', paste(colnames(df[,1:p]), sep = '', collapse = '+')))
glm_fit = glm(fml, df, family = 'binomial')
summary(glm_fit)
ps_hat = glm_fit$fitted.values
# plot(ps_hat, ps_true)
# abline(0, 1, col='red')
## Examine common support
table(df$Z)
par(mfrow=c(1,1))
hist(ps_hat[df$Z==0], main = 'Probability of being exposed', xlab = '')
hist(ps_hat[df$Z==1], add = T, col = 'grey')
legend('topright', c('Control','Exposed'), fill=c('white','grey'))
## ------------------------------------------------------------------------------------------------------
## Matching individuals based on observed confounders, or propensity score
## Propensity score: probability of being exposed to treatment conditional on baseline characteristics
## ------------------------------------------------------------------------------------------------------
## CASE 1: confounders are known (X1 and X2)
match_out = Match(Y=df$Y, Tr=df$Z, X=df[,1:2], estimand = 'ATE')
tau_hat = as.numeric(match_out$est)
tau_CI_m1 = round(c(tau_hat-1.96*match_out$se, tau_hat+1.96*match_out$se), 2)
tau_relBias_m1 = abs(tau_hat - tau_true)/tau_true * 100
## CASE 2: confounders are not known --> matching on all covariates
match_out = Match(Y=df$Y, Tr=df$Z, X=df[,1:p], estimand = 'ATE')
tau_hat = as.numeric(match_out$est)
tau_CI_m2 = round(c(tau_hat-1.96*match_out$se, tau_hat+1.96*match_out$se), 2)
tau_relBias_m2 = abs(tau_hat - tau_true)/tau_true * 100
## CASE 3: confounders are not known --> matching on propensity score (perfectly specified)
match_out = Match(Y=df$Y, Tr=df$Z, X=ps_true, estimand = 'ATE') #, caliper = 0.2)
tau_hat = as.numeric(match_out$est)
tau_CI_m3 = round(c(tau_hat-1.96*match_out$se, tau_hat+1.96*match_out$se), 2)
tau_relBias_m3 = abs(tau_hat - tau_true)/tau_true * 100
## CASE 4: confounders are not known --> matching on propensity score (imperfectly specified)
match_out = Match(Y=df$Y, Tr=df$Z, X=ps_hat, estimand = 'ATE') #, caliper = 0.2)
tau_hat = as.numeric(match_out$est)
tau_CI_m4 = round(c(tau_hat-1.96*match_out$se, tau_hat+1.96*match_out$se), 2)
tau_relBias_m4 = abs(tau_hat - tau_true)/tau_true * 100
## Print results
tau_relBias_k = round(c(tau_relBias_naive, tau_relBias_m1, tau_relBias_m2,
tau_relBias_m3, tau_relBias_m4), 3)
out = data.frame(Method = c('Naive', 'Matching on known confounders', 'Matching on all variables',
'Matching on PS', 'Matching on estimated PS'),
Relative_Bias = tau_relBias_k)
out = out[order(out$Relative_Bias),]
knitr::kable(out)
tau_relBias[k,] = tau_relBias_k
}
## Visualise results
par(mfrow=c(1,1))
boxplot(tau_relBias[,1], tau_relBias[,2], tau_relBias[,3], tau_relBias[,4], tau_relBias[,5],
names = c('Naive', 'X1-2', 'Xall', 'PStrue', 'PShat'), ylab = 'Absolute Relative Bias in tau (%)',
border = c('black','black','black','grey60','black'), main = 'Performance in estimating the ATE')
apply(tau_relBias, 2, mean)
Here is the other part of Thierry’s code…the one he is sourcing. Hope it works…let me know if you run into trouble.
df_generate = function(n, p, seed) {
library(arm) # for the invlogit() function
## Generate covariate matrix (X)
## n = number of subjects
## p = number of baseline covariates
X = matrix(rnorm(n*p, mean=0, sd=1), n, p)
## Simulate a binary treatment indicator (Z)
## ps (propensity score): probability of being exposed conditonal on X
ps = invlogit(-2 + 1.5*X[,1] + 1.75*X[,2] + 1.25*X[,3]) # logistic regression model
# hist(ps_true)
# seed = sample(1:500, 1)
set.seed(seed)
Z = rbinom(n, 1, ps) # sample Bernoulli variable based on conditional probabilities
# table(Z)
## Simulate a Gaussian outcome variable (Y)
tau = 2
Y = 10 + 1.75*X[,1] + 1.5*X[,2] + 1.25*X[,4] + tau*Z # linear regression model with treatment effect
# hist(Y)
## Bind data
df = data.frame(X, Z, Y)
return(list(df=df, ps=ps, tau=tau))
}
Hi R-Users!
Our next meeting is happening next Tuesday (26/09/2017) in the Biology Tearoom at 3 pm, as usual. Dr. Maina Mbui, Lecturer in Spatial Information Science at the Department of Environmental Sciences, here at Macquarie Uni, will talk about Model Selection in R. James and Kaja will also provide our third R-Introduction this year. If you need a little R refresher or want to make the first contact with R, please show up at the Biology Tearoom at 1.30pm on the same day. Bring your laptop and have R and R Studio installed!
For those who are new to our little community, don’t be afraid to just pop by. We are R-Users with very different background and skill level and are meeting informally each month to talk about problems and solutions in R. If you don’t have time, just make some time or sign up under http://quantitative-advice.gg.mq.edu.au/
If you need any support with R, the Macquarie R-Users Group can help.
See you soon,
Kaja, Ina, James, and René
P.S. We are still looking for support in 2018! MRes and 1st year PhD students preferred. Write us: rene.heim@hdr.mq.edu.au