Lab 2: Creating High Quality Graphics Lab

Work through this lab by running all the R code to your computer and making sure that you understand the input and the output.

We encourage you to work through this lab with a partner.

Once you get to the end of this lab, go to https://web.stanford.edu/class/bios221/cgi-bin/index.cgi/ to answer some questions that require you to use ggplot2. You may NOT work with other students to answer the questions on OHMS. You will need a Stanford ID to log in to OHMS.

Back to Labs main page.

Getting Started

The goal of this lab is to become familiar with the grammar of graphics philosophy implemented in the R package ggplot2.

You will want to refer to the documentation for this package which is available at http://docs.ggplot2.org/. A complete guide to the Grammar of Graphics philosophy is available through Springer Link (which you have free legal access to with your Stanford ID) http://link.springer.com/book/10.1007/978-0-387-98141-3/page/1

In order to learn how to use this graphing package, we will use a dataset of experimental data called parathyroidSE available through Bioconductor: http://www.bioconductor.org/packages/release/data/experiment/html/parathyroidSE.html You need to make sure the packages parathyroidSE and ggplot2 are installed on your computer, and load it into R as follows:

# I will use the command invisible() to surpress the output when 
# the output is unimportant for you to see
library("parathyroidSE")
library("Biostrings")
library("ggplot2")
library("nutshell")

If these are not installed, then you can install them by uncommenting the commands below.

#source("http://bioconductor.org/biocLite.R")
#biocLite("parathyroid")
#install.packages("ggplot2")
#install.packages("nutshell")

We will be going through some the topics here: http://docs.ggplot2.org/ You should keep that website open as a reference throughout this tutorial.

To learn about the graphics of grammar, we will use a data set about births in the USA in 2006. You can read more about this in the “R in a Nutshell, 2nd edition” book which is freely available as a .pdf online. This is convenient since you can follow the same methods in that book (except translating into the grammar of graphics to make plots).

First, load this data (make sure you already did library(nutshell) first).

data(births2006.smpl)
# look at this data set by uncommenting the line below
#help(births2006.smpl)
births.big <- births2006.smpl
# add a new variable to include whether it is a weekend or not
births.big$WEEKEND <- c("Weekday","Weekend")[(
  births2006.smpl$DOB_WK %in% c(1, 7)) + 1]
# courser health score based on APGAR 5 score
births.big$HEALTH <- c("CritLow","Low","Normal")[(
  findInterval(births2006.smpl$APGAR5, c(3.5, 6.5))) + 1]
# work with a smaller sample (just first 10000 instead of 427323)
# because my laptop is old and I'm impatient
births.big$ESTGEST <- replace(births.big$ESTGEST, births.big$ESTGEST==99, NA)
births.small <- births.big[1:10000, ]

Plot Creation & Geoms in the Grammar of Graphics

The function ggplot() is used to construct a plot incrementally, using the + operator to add layers to the existing ggplot object. This is advantageous in that the code is explicit about which layers are added and the order in which they are added. The layers that we add are geometric objects, or “geoms”. You can think of the call ggplot as taking out a piece of paper and the call geom_something draws something on the piece of paper. For example, geom_bar draws a bar plot. The first argument in ggplot is the data that is going to be drawn. We also need to describe the aesthetics of the plot and we do this with the function aes(). The aesthetics of a plot are the specifics about what is going to be drawn and how. This can be included in the initialization ggplot(data=..., aes(...)) or in the geometric object geom(aes(...)).

For example, let’s look at the number of births on each day of the week in our sample data. First set up our plot with the data births.

ppp <- ggplot(births.small) # plot with a subset of the data (faster)

Note that this doesn’t actually plot anything (just like if you wrote junk <- 7 doesn’t output anything until you write junk. What happens here when you type ppp?

The number of births per day of the week can be shown easily in a barplot, so let’s use that. To create a geometric object layer for a barplot we use the function geom_bar(). In order for it to know what part of the data we want to actually plot, we need to give an aesthetic. We can do this by declaring the aesthetic for the x-axis is the day of the week of the birth. The column DOB_WK gives the day of the week that each birth happened as a numeric value 1 (Sunday) through 7 (Saturday). We can tell R that this is actually a factor variable by putting the variable name in the function factor(). Putting this all together, we get geom_bar(aes(x=factor(DOB_WK)). Finally, to draw this layer on top of the initialized graph, we will add the geometric object layer with the + operator.

ppp + geom_bar(aes(x=factor(DOB_WK)))

Doctors are able to delay birth with anti-contraction medications or labor represents called tocolytics. That might be one reason we see fewer births on day 1 of the week (Sunday) and day 7 (Saturday).

We can get further information from this geom if we add more aesthetics. For example, maybe we can fill each bar with different colors corresponding to what type of birth it was (“C-section”, “Vaginal”, or “Unknown”). We can do this by just including another aes in the geometric object. Start with the same initialization, but add a geometric object with the x-axis and also fill color defined.

ppp + geom_bar(aes(x=factor(DOB_WK), fill=DMETH_REC))

When we made that plot, we used a default value for the position argument of the barplot geometric object. We could have rewritten the above command as follows.

ppp + geom_bar(aes(x=factor(DOB_WK), fill=DMETH_REC), position="stack")

Another option is to use the position="dodge". Note that this is an argument to geom_bar and not to aes.

ppp + geom_bar(aes(x=factor(DOB_WK), fill=DMETH_REC), position="dodge")

Now it’s a bit clearer to see that about 1/2 as many C-sections on an average weekend as there are on an average weekday, whereas there are about 3/4 as many vaginal births on an average weekend as there are on an average weekday.

Facets in the Grammar of Graphics

Let’s continue looking at the birth data on a day to day basis. We might conjecture that older women are less likely to take a tocolytic since they are more likely to have had other complications already. One way we can do this is to “facet” the graph and display day of the week versus women’s age.

First, let’s make a histogram of all women’s ages to get an idea of what this overall distribution looks like.

ppp + geom_histogram(aes(x=MAGER), binwidth=1)

We use the argument binwidth=1 to say the width of the bins in this histogram.

Using the grammar of graphics, it is easy to facet this single graph to make multiple graphs along a dimension. In this case, we’re interested in breaking this up along the dimension of day of birth DOB_WK. We will add this facet with the command facet_grid or facet_wrap. In facet_grid, the argument we use is a formula with the rows (of the tabular display) on the LHS and the columns (of the tabular display) on the RHS. A formula is separated in R with the tilde character ~. A dot in the formula is used to indicate there should be no faceting on this dimension (either row or column). The formula can also be provided as a string instead of a classical formula object. In facet_wrap, we have the same sort of argument, but we only include a RHS of the formula. We’ll use both of them in an example so you can see the difference.

Now let’s look at this faceting on that variable. Again, we will use the + operator. Here, we also see that we can save the geometric objects in the plot and just add facets at the end.

ppp1 <- ppp + geom_histogram(aes(x=MAGER, fill=DMETH_REC), binwidth=1)
ppp1 + facet_wrap( ~ DOB_WK)

ppp1 + facet_grid(DOB_WK ~ SEX)

What is the differences between facet_wrap and facet_grid?

Here is an interesting perspective of the data.

# exclude the Unknown delivery method data
ppp.big <- ggplot(subset(births.big, DMETH_REC %in% c("C-section","Vaginal"))) 
ppp.big + 
  geom_histogram(aes(x=MAGER, fill=factor(TBO_REC)), binwidth=1) +
  facet_grid(WEEKEND ~ DMETH_REC, scale="free_y", drop=TRUE) +
  geom_vline(xintercept=seq(15,45,by=5), alpha=0.2, color="white") +
  labs(title="Births in USA 2006", fill="Birth\nOrder")

Statistics in the Grammar of Graphics

It’s often useful to transform your data before plotting, and that’s what statistical transformations do.

Here we look at the histogram of mother’s ages as before, but we also add a density estimate to it.

ppp.m <- ggplot(births.small, aes(x=MAGER)) + 
  geom_histogram(aes(y=..density..), binwidth = 1, fill="grey", col="black")
ppp.m + stat_density(aes(x = MAGER), col="red", fill=NA)

Maybe we want to compare this to a normal distribution’s density.

ppp.m + stat_density(aes(x = MAGER), col="red", fill=NA) +
  stat_function(fun = dnorm, col = "blue", 
                arg=list(mean=mean(births.small$MAGER),
                         sd=sd(births.small$MAGER)))

You are able to add any function using the stat_function.

In this example, we will plot the birth weight in grams versus weight gain by mother (which seems to be in pounds). Unfortunately, we need to be a little more careful about this when there are NAs in the data.

ppp2 <- ggplot(births.small[
  !is.na(births.small$WTGAIN) & !is.na(births.small$DBWT), ], 
               aes(x=WTGAIN, y=DBWT)) +
  labs(x="Weight Gain by Mother", y="Birth Weight in Grams")
ppp2 + geom_point()

When we plot this data by itself, it is not clear what is going on – there are many points plotted on top of each other and it just looks messy.

One way to transform this data with a stat_ in ggplot2 is by binning the points as below.

ppp2 + stat_bin2d()

See that the color axis reports counts. That means we count the total number of observations to fall within a box. This plot seems to indicate that the joint distribution of weight gain by mother and birth weight in grams is unimodal.

Instead of just making counts though, we can compute functions within each of the 2d bins on another column in the data frame. For example, let’s look at the median number of weeks of gestation for each of the bins.

ppp2 + stat_summary2d(aes(z=ESTGEST), fun=median) + labs(title="median # weeks of gestation")

## Warning: Removed 10 rows containing missing values (stat_summary2d).

If we want to do a regression, we can include this automatically. By default, ggplot2 uses “locally weighted scatterplot smoothing” (loess) regression for data sets with < 1000 points and “generalized additive models” (GAM) for >= 1000 points.

ppp2 + geom_point() + stat_smooth(method=lm) # here we force a linear model

By changing the aesthetics of the fill color of the model, we can fit different models on subsets of the data and compare them. For example, let’s look at a smoothed fit (GAM) of birthwegith versus weight gain by mother separated out by baby boys and girls.

ppp2 + geom_point() + stat_smooth(aes(fill=SEX))

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Scales in the Grammar of Graphics

Scales control the mapping between data and aesthetics. Going back to the 2d distribution of birthweight bersus weight gain by mothers, it is difficult to see what is going on except that there is a dense region and a less dense region. If we take the square root of the number of births per region, then we can see that there is a smooth transition between the high density area and the low density area.

ppp2 + stat_bin2d() + scale_fill_gradient(trans="sqrt")

See what happens when you change the trans to log10.

In the same plot as above, we might want to change the scale of the vertical axis so we get better separation of the mean weeks of gestation. According to Wikipedia’s page on “Fetal Viability”, there is a 50% chance of viability at 24 weeks.

ppp2 + stat_summary2d(aes(z=ESTGEST), fun=mean) + 
  scale_y_log10(limits=c(100,10000)) +
  scale_fill_gradient2(midpoint=24) +
  labs(title="mean # weeks of gestation")

## Warning: Removed 10 rows containing missing values (stat_summary2d).

Look at only the quadruplets in the full dataset. We want to include lots of variables such as number of prenatal visits UPREVIS, the mother’s age MAGER, the estimated weeks of gestation ESTGEST, the delivery method DMETH_REC, and the mother’s education level DMEDUC.

ppp3 <- ggplot(subset(births.big, DPLURAL=="4 Quadruplet"), 
               aes(x=UPREVIS, y=MAGER)) + 
  geom_point(aes(size=ESTGEST, shape=DMETH_REC, col=DMEDUC)) +
  stat_smooth(aes(x=UPREVIS, y=MAGER, fill=DMETH_REC), method="lm")
ppp3

## Warning: Removed 4 rows containing missing values (geom_point).

ppp3 + scale_size(range=c(3, 6)) + scale_color_brewer(palette="Set1") +
  scale_shape(solid=FALSE)

## Warning: Removed 4 rows containing missing values (geom_point).

While shape must be discrete, colors and sizes can both be scaled continuously.

Parathyroid Example

We will use the parathyroidGenesSE package in R. Load the data and read the experimental information and the abstract.

data("parathyroidGenesSE")
exptData(parathyroidGenesSE)$MIAME

## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Experiment data
##   Experimenter name: Felix Haglund 
##   Laboratory: Science for Life Laboratory Stockholm 
##   Contact information: Mikael Huss 
##   Title: DPN and Tamoxifen treatments of parathyroid adenoma cells 
##   URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211 
##   PMIDs: 23024189 
## 
##   Abstract: A 251 word abstract is available. Use 'abstract' method.

abstract(exptData(parathyroidGenesSE)$MIAME)

## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."

Parathyroid adenoma (http://en.wikipedia.org/wiki/Parathyroid_adenoma) is a benign tumor of the parathyroid gland. The abstract tells us that some interesting genes to look at are the following. Estrogen related genes: ESR1, ESR2. Parathyroid related genes: CASR, VDR, JUN, CALR, ORAI2. The genes are labeled in Ensembl notation http://www.ensembl.org. I looked them up and put them into R for you.

gene.names <- matrix(ncol=3, byrow=T, data=c(
  "ESR1",  "ENSG00000091831", "estrogen",
  "ESR2",  "ENSG00000140009", "estrogen",
  "CASR",  "ENSG00000036828", "parathyroid",
  "VDR",   "ENSG00000111424", "parathyroid",
  "JUN",   "ENSG00000177606", "parathyroid",
  "CALR",  "ENSG00000179218", "parathyroid",
  "ORAI2", "ENSG00000160991", "parathyroid"))

Make the table of gene counts, add the patient info

countData <-assay( parathyroidGenesSE ) 
gene.counts <- t(countData[gene.names[,2],])
colnames(gene.counts) <- gene.names[,1]
dat <- cbind(data.frame(colData( parathyroidGenesSE)), data.frame(gene.counts))
# change dat$time to a numeric 24 and 48 instead of factor 24h and 48h
dat$time <- as.numeric(gsub("h","",(as.character(dat$time))))

Plot the estrogen related genes counts (ESR1 and ESR2) with plot aesthetics to separate patients, treatments, and times.

ppt <- ggplot(dat, aes(color=patient, shape=treatment, size=time)) +
  scale_size_area(breaks=c(24,48)) + scale_shape(solid=F)
ppt + geom_point(aes(x=ESR1, y=ESR2), position=position_jitter(w=.1, h=.1))

Here we jittered the points some so that there was not overplotting and we can see all the points.

Questions

Answer the questions on OHMS “Lab 1: Graphics”. Go to https://web.stanford.edu/class/bios221/cgi-bin/index.cgi/ to answer some questions that require you to use ggplot2. You may NOT work with other students to answer the questions on OHMS. You will need a Stanford ID to log in to OHMS.