I am an experimental social psychologist at Stanford University. I study scalable behavior change, especially in the context of education. Towards this end, I founded PERTS, the Project for Education Research That Scales. I work with Professors Greg Walton, Carol Dweck, Geoff Cohen, Zak Tormala, and David Yeager.
I love to do all of the following, in no particular order:
I enjoy working with foundations, schools, research groups, and businesses to help them measure, understand, and affect individuals' behavior. Contact me at paunesku at gmail dot com if you'd like my help.
I made DataPlay.js to manipulate data with JavaScript. It was motivated by my desire to distribute the processing load associated with generating reports for educators and researchers who work with PERTS. We considered crunching the data in PHP or MySQL or by calling R from Apache, but there was an obvious appeal to doing the processing client-side. It turned out to be pretty easy to make a little JS library that makes data lookup and aggregation a snap. Take a look at the Github repository to download it or for examples.
I love R and the R community. It seems like every time I have an obscure question about data analysis or graphing, several good Samaritans have already posted their answers on Stack Overflow, Cross Validated, or other totally sweet R pages and blogs. When I can, I try to give back. In that spirit, I'm including some functions below that may prove useful to others.
My favorite package for graphing in R is ggplot2. Hats off to Hadley Wickham, its creator. ggplot2 provides lots of useful options that make it easy (once you know what you're doing) to make highly informative graphs. However, it's not always entirely intuitive. I often find myself looking at my notes or the documentation to remind myself how to change plot parameters. In this section, I provide a few basic use cases in the hopes that it will provide useful to others (and to my future self).
ggplot2 underwent some major revisions in version 0.9.2.1. Those broke a lot of my old ggplot2 code, but I can't complain because it made it easier to customize plots. The examples below assume you're using version 0.9.2.1 or greater.
To do some graphing, we'll need data to graph. Let's simulate some data from a hypothetical study. In this study, 2/3 of students are assigned to a treatment designed to boost math scores, and 1/3 are assigned to a control condition. Their math scores are collected before and after treatment. We want to visualize the effects of the treatment on follow-up scores.
# generate a data frame "d" with the data to graph
set.seed( 1 ) # set the random seed
n <- 200 # number of observations
es <- .7 # effect size
# oversample the treatment condition
conditions <- c( rep("treatment",2) , "control" )
assigned_condition <- sample( conditions , n , replace=TRUE )
treated <- assigned_condition == "treatment"
prescore <- rnorm( n )
postscore <- prescore + rnorm( n )
# create an interaction between treatment and prescore
# so that treatment is more effective for initial low-performers
interaction <- es - prescore[treated]
postscore[treated] <- interaction + postscore[treated]
d <- data.frame( prescore = prescore,
postscore = postscore,
condition = assigned_condition )
I love ggplot2, but I'm not crazy about some of the aesthetic defaults. Here we create a theme called horzi_theme that makes the background white and sets the background to have thin gray horizontal but not vertical lines. We'll use horzi_theme in each of the subsequent graphs.
# horzi theme makes the background white with thin, horizontal lines library( grid ) # for units horzi_theme <- theme( # remove the gray background panel.background = element_blank() , # make the major gridlines light gray and thin panel.grid.major.y = element_line( size=.1, colour="#666666" ) , # suppress the vertical grid lines panel.grid.major.x = element_blank() , # suppress the minor grid lines panel.grid.minor = element_blank() , # add axes axis.line = element_line( size=.2 , colour="#666666" ), # adjust the axis ticks axis.ticks = element_line( size=.2 , colour="#666666" ), # move the y-axis over to the left axis.title.y = element_text( angle=90, vjust=-.1, hjust=.5 ), # increase margin moved-over y-axis label fits plot.margin = unit( c(.5,.25,.25,.5) , "in") )
It's always smart to look at distributions first. Let's compare the two conditions' distributions with: 1) histograms in separate facets and 2) overlapping density curves.
Give each condition its own facet and histogram. Note that it's hard to compare the distributions by eye because there are more students in the treatment condition. That's lame, and it's where density plots come in.
ggplot( d , aes( postscore, fill=condition ) ) +
# binwidth=1 groups responses along bins of 1 x-axis unit
geom_histogram( binwidth=1 ) +
# override the default colors for each condition
scale_fill_manual( values=c("control"="red","treatment"="blue" ) ) +
# include horzi_theme to change the default background etc.
horzi_theme +
# create separate facets for condition
facet_grid( . ~ condition )
Density curves look cooler. They're also more useful when distributions are unbalanced in count: They show the percent of the given distribution instead of the absolute count.
# the scales library enables % formatting (used for y-axis below) library( scales ) ggplot( d , aes( postscore, fill=condition ) ) + # we need to make it transparent so we can see the overlap geom_density( alpha = .3 , size=0 ) + scale_y_continuous( labels=percent ) + # set the curve colors scale_fill_manual( values=c( "red","blue" ) ) + horzi_theme + # make the legend a little prettier guides( fill = guide_legend( title = "Condition" , # change font-face to remove bold title.theme = element_text( face="plain", angle=0 ) , # remove the black border around keys override.aes = list( colour="white" ) ) )
What about that venerable classic (sarcasm), the bar graph?
# make a simple graph showing postscore by condition ggplot( d , aes( condition, postscore, fill=condition ) ) + geom_bar( stat="summary", fun.y="mean", position="dodge" ) + # manually set colors and corresponding labels; no need for a legend scale_fill_manual( guide = "none", values = c( "control"="red", "treatment"="blue" ) , breaks = c( "control", "treatment" ) , labels = c( "control group", "treatment group" ) ) + # label the graph and axes ggtitle( "A totally lame bar graph" ) + # add a y-axis title scale_y_continuous( "Post-Study Math Score" ) + horzi_theme
Now let's make a bar graph with error bars. To make error bars we'll pre-compute a new data frame "g" with the condition means and error bar limits.
# compute the means and save to g
g <- aggregate( postscore ~ condition, data = d , mean )
# append the standard deviation to g
gsd <- aggregate( postscore ~ condition, data = d , sd )
g <- merge( g, gsd , by="condition", suffixes=c("",".sd") )
# append the n to g
glen<- aggregate( postscore ~ condition, data = d , length )
g <- merge( g, glen , by="condition", suffixes=c("",".len") )
# calculate the standard error of the mean
g$se <- g$postscore.sd/(g$postscore.len)^.5
# compute the error bar limits
g$upper <- g$postscore + g$se
g$lower <- g$postscore - g$se
And now for the actual graph code...
# now make the bar graph with error bars ggplot( g , aes( condition, postscore, fill=condition) ) + # stat="identity" because the value is already in g (no averaging needed) geom_bar( stat="identity", position="dodge" ) + # manually set colors and corresponding labels; no need for a legend scale_fill_manual( guide = "none", values = c( "control"="red", "treatment"="blue" ) , breaks = c( "control", "treatment" ) , labels = c( "control group", "treatment group" ) ) + # add the error bars geom_errorbar( aes( ymax=upper, ymin=lower ) , width =.25, position="dodge", color ="#666666" ) + # label the graph ggtitle( "A slightly more informative graph\n(because of the error bars)" ) + # adjust the title font and size, just for fun theme( title = element_text( family="Verdana", size=10 ) ) + # add a y-axis title scale_y_continuous( "Post Scores" ) + horzi_theme
Neither of the bar graphs gave us any indication that the treatment was more effective for intial low-performers. So let's make a line graph to visualize this moderation effect.
# make a line + scatter graph showing prescore and postscore by condition ggplot( d , aes( prescore, postscore, color=condition, linetype=condition) ) + # add points (scatterplot) geom_point( position="jitter" , # jitter the points to prevent overlap alpha=.3 # reduce opacity ) + # create a trendline for each condition geom_smooth( method="lm", # method="lm", use a linear model stat::lm() se=TRUE , # se=TRUE, show 95% confidence bands fullrange=TRUE # fullrange=TRUE, extend trendlines to margin ) + # manually set colors and corresponding labels scale_colour_manual( values = c( "control"="red", "treatment"="blue" ) , # define the groups breaks = c( "control", "treatment" ) , # rename the groups on the legend labels = c( "control group", "treatment group" ) , # change the legend parameters guide = guide_legend( title="Treatment Condition", title.theme = element_text( size = 13, face = "plain", color = "#333333", angle = 0 ) , label.theme = element_text( size = 12, face = "plain", color = "#333333", angle = 0 ) , ) ) + # manually set linetypes and corresponding labels scale_linetype_manual( values = c( "control"="solid", "treatment"="dashed" ) , guide="none" # suppress linetype legend ) + # adjust the zoom of the graph (not useful in this case) coord_cartesian( x=c(-2,2) , y=c(-2,2) ) + # label the graph and axes ggtitle( "A much more informative graph" ) + xlab( "Pre-Score" ) + # just for fun, we'll change the y-axis breaks scale_y_continuous( "Post-Score" , breaks=c(-1:1) ) + # change the look of the graph theme( # make the title 16 point Verdana and axis titles 12 point Verdana title = element_text( family="Verdana", size=12 ) , axis.title.x = element_text( size=12 ) , axis.title.y = element_text( size=12 ) ) + horzi_theme
Stratified randomization keeps the size of randomly-assigned experimental conditions balanced across specified sample characteristics, e.g., race, gender, previous performance. The two functions below make it easy to conduct stratified randomization in R.
When conducting experiments, it is important to ensure that the sample of individuals in each experimental condition is similar. For example, let's say that proportionally more high achievers than low achievers were randomly assigned to the treatment condition. If the treatment condition experienced greater improvement along the outcome of interest, it would be difficult to ascertain whether the improvement was due to the treatment itself or due to high achievers improving more quickly than low achievers in general. Although there are post-hoc ways to statistically control for such imbalances, failures of randomization can complicate analyses and damage researchers' ability to rigorously draw causal inferences about the effects of a given treatment.
This code has two (interdependent) functions and some example code for testing them.
# balanced_randomization
# Inputs:
# ids vector of unquie ids to randomize to one of the groups
# groups groups to randomize ids to
# Returns:
# data frame with columns id and randomly assigned group
# Description:
# ids are randomly assigned to groups. Groups are balanced, i.e.,
# each group has the same # of ids. If the # of ids is not divisible
# by the # of groups, the remainder is assigned randomly.
balanced_randomization <- function( ids, groups = c() ) {
# make sure ids are not duplicated
if( TRUE %in% duplicated(ids) ){ stop("Duplicate IDs not allowed!") }
# ids should be characters because factors are will match falsely
ids <- as.character( ids )
# generate a data frame that will contain the group assignments
sampling_df <- data.frame( id=ids,
group=rep( NA, length(ids) ) ,
stringsAsFactors=FALSE
)
# keep track of ids that have already been randomized
already_sampled <- c()
# how many ids should be assigned to each group?
size_per_group <- floor( length(ids) / length(groups) )
# randomize ids to each group
for(i in 1:length(groups) ) {
# sample the ids for this group, excluding those already assigned
ids_in_group <- sample(
ids[ ! ids %in% already_sampled ] ,
size_per_group ,
replace=FALSE
)
# update the list of already assigned ids
already_sampled <- c( already_sampled, ids_in_group )
# set the group assignment
sampling_df$group[ sampling_df$id %in% ids_in_group ] <- groups[i]
}
# if the number of ids is not evenly divisble by the number of groups,
# randomly fill out the remaining, unfilled entries
remainder <- length(ids) %% length(groups)
if(remainder > 0){
sampling_df$group[ is.na(sampling_df$group) ] <-
sample( groups , remainder, replace=FALSE)
}
return( sampling_df )
}
# stratified_randomization
# (requires the package "plyr" and the "balanced_randomization" function)
# Inputs:
# DF data frame containing ids to randomly assign and the
# corresponding stratification characteristics for each id
# id a string specifying which column of DF is the id to randomize
# strata a vector of strings specifying how to stratify the data
# groups groups to randomize ids to
# Returns:
# data frame with a "group" column appended with the randomly assigned group
# Description:
# For each stratum (unique combination of strata), ids are randomly assigned
# to one of the passed in groups in a balanced way (using balanced_randomization).
stratified_randomization <- function( DF, id, strata, groups ){
# we need plyr for its handy dlply function. Thanks, Hadley!
if( ! require( plyr ) ){ stop( "You need to install plyr!" ) }
# set the strata (features you want to stratify on)
strata <- dlply( DF , strata )
# perform balanced randomization within each stratum
for( stratum in strata ){
# the first time, create the data frame with assignment
if( ! exists("strat_df") ){
strat_df <- balanced_randomization( stratum[ , id], groups)
}
# subsequent times append to data frame with assignments
else{
strat_df <- rbind( balanced_randomization( stratum[ , id], groups) ,
strat_df )
}
}
return( merge( DF, strat_df, by=id ) )
}
# BALANCED RANDOMIZATION EXAMPLES
# Sample from 3 groups equally
ids_to_sample <- c(1:100)
groups <- c("Group 1","Group 2", "Group 3")
sampled_df <- balanced_randomization( ids = ids_to_sample, groups = groups)
table(sampled_df$group)
# Sample from 3 groups where you weight Group 1 2x
ids_to_sample <- c(1:100)
groups <- c( rep( "Group 1", 2 ),"Group 2", "Group 3")
sampled_df <- balanced_randomization( ids = ids_to_sample, groups = groups)
table(sampled_df $group)
# STRATIFIED RANDOMIZATION EXAMPLE
# generate a fake dataset "d"
ids_to_sample <- c(1:100)
gender <- sample( c("M","F") , 100, replace=TRUE )
race <- sample( c("W","B","L") , 100, replace=TRUE )
d <- data.frame( id=ids_to_sample , gender=gender, race=race )
# define the groups to randomize to
groups <- c( "Group 1","Group 2", "Group 3")
# specify the data frame, the id, the stratification features, and the groups to randomize to
stratified <- stratified_randomization( d , id="id", strata=c("race","gender"), groups=groups )
# confirm the conditions are balanced
aggregate( id ~ group + gender + race, stratified, length )
Found something you like or something you don't like? I'd love to get your feedback.