Lab 8: Networks

Conceptual Goals

The purpose of this lab is to work through some examples to help you better understand the following:

Basics of Graphs and Networks
Network Visualization
Network Analysis
Getting STRING data and working with it in R

Definitions are in italics, R code snippets are in verbatim (monospace), questions for you to think about are in bold. Data can be downloaded from the data website http://www.stanford.edu/class/bios221/data/

You do not need to submit answers to the questions in this lab. The lab questions that you need to submit answers to will be posted on OHMS.

Graphs and Networks

A Graph \(G = (V, E)\) is a set of \(n\) vertices in \(V\) and a set of edges \(E\) which is a set of unordered pairs of vertices. We say \(i \sim j\) if vertex \(i\) is adjacent to vertex \(j\). Graphs are symmetric: \(i \sim j \Rightarrow j \sim i\).

An Adjacency Matrix \(\bf A\) is the matrix representation of \(E\). Iff \(i \sim j\) then the \((i,j)\) entry of \({\bf A}\) is 1. Iff \(i \not\sim j\) then the \((i,j)\) entry of \({\bf A}\) is 0. So \({\bf A} \in \{0,1\}^{n \times n}\). Since graphs are symmetric, the adjacency matrix is symmetric \({\bf A} = {\bf A}^T\).

A Network is a weighted, directed graph. Networks have adjacency matrices \({\bf A} \in R_+^{n \times n}\). Networks are not necessarily symmetric. Graphs are networks, networks are not necessarily graphs.

See the WikipediA article on [Graph (mathematics)](http://en.wikipedia.org/wiki/Graph_(mathematics)) for further clarifications. Browse through the Wikipedia article on Glossary of graph theory for many important terms. Here are some useful graph theory terms: graph, vertex, edge, loop, graph complement, subgraph, walk, path, cycle, acyclic graph, bipartite, diameter, girth, tree, leaf, forest, spanning tree, clique, clique number, degree, in-degree, out-degree, neighborhood, connected, disconnected, connected component, weighted graph and directed graph (i.e., “network”), out neighborhood, in neighborhood.

Finite State Markov Chains

Any finite state space Markov chain can be represented as a network. Any network in which every vertex has an out-degree of 1 describes a Markov chain. See chapter 1 of Markov Chains and Mixing Times or the Wikipedia article on Markov chain for some connections between Markov chains and networks.

Network Visualization

Graphs and networks have a structure which makes them natural but sometimes difficult to represent as an image. While numbers can be represented on a number line (and plotted against each other), we cannot do this with graphs and networks.

There are two common ways to represent graphs.

Draw (or plot) a graph by plotting vertices as points in two dimensions. Connect two vertices with a line segment if there is an edge between the vertices.
This is how graphs are usually visualized. This is useful when edges represent similarity between vertices because points that are similar will appear closer to each other in the figure.
Make a heatmap image of the adjacency matrix. This is useful for larger networks when the vertices can be organized (clustered) into several groups. This is useful when the edges represent interactions between the vertices because it is easy to see how groups of vertices relate to other groups. This visualization is only as good as the way you order the vertices into clusters.

We will first go over a simple example and then make some remarks about these figures.

Very Small Examples

Suppose we have a 5 state Markov chain with states \(a, b, c, d, e\). From state \(a\), we can get to states \(a, b, e\). From state \(b\), we can get to states \(a, d\). From state \(c\), we can get to states \(c, e\). From state \(d\), we can get to states \(c, d, e\). From state \(e\), we can get to state \(a\). Suppose we transition from the current state to the next with equal probability over all possible edges (i.e., from \(a\), we can get to states \(a, b, e\) each with probability \(1/3\)).

First we’ll construct the matrix of potential connections between the states and call that \(A\). Then we’ll divide by the sum in each row to get a transition matrix \(P\).

v1 <- letters[1:5] # vertex names: a, b, c, d, e
#            a b c d e
A1 <- rbind(c(1,1,0,0,1),
            c(1,0,0,1,0),
            c(0,0,1,0,1),
            c(0,0,1,1,1),
            c(1,0,0,0,0))
dimnames(A1) <- list(v1,v1) # assign the names to A
A1

##   a b c d e
## a 1 1 0 0 1
## b 1 0 0 1 0
## c 0 0 1 0 1
## d 0 0 1 1 1
## e 1 0 0 0 0

P1 <- diag(1/apply(A1, MARGIN=1, FUN=sum)) %*% A1
P1

##           a      b      c      d      e
## [1,] 0.3333 0.3333 0.0000 0.0000 0.3333
## [2,] 0.5000 0.0000 0.0000 0.5000 0.0000
## [3,] 0.0000 0.0000 0.5000 0.0000 0.5000
## [4,] 0.0000 0.0000 0.3333 0.3333 0.3333
## [5,] 1.0000 0.0000 0.0000 0.0000 0.0000

We can draw this in R using statnet like this.

require(statnet) # load the package statnet to visualize networks in R

A1net <- network(A1, directed=T) # make A into a network object
A1net

##  Network attributes:
##   vertices = 5 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 8 
##     missing edges= 0 
##     non-missing edges= 8 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

# plotting a network object uses a random number generator,
# so we will set a seed to make sure it looks the same each time
set.seed(1)
plot(A1net, label=v1) # plot the network as a graph

plot of chunk unnamed-chunk-6

There is no good software for plotting a network which takes directedness and edge weights into account. There is good software for plotting graphs. Austen recommends using the package statnet (which combines several different other packages) to draw graphs. You can read about that here http://csde.washington.edu/statnet/.

When plotting graphs, the goal is to arrange vertices into a two dimensional space so that adjacent vertices are near each other, and non-adjacent vertices are not near each other. This is usually done with Force-directed graph drawing algorithms. You can think of balls in place of the vertices and connecting two balls with a short spring if there is an edge between them and a long spring if there is not an edge between them. If you jiggle this mess of balls and springs around, you might get a configuration where all the springs are being stretched or compressed with approximately equal force. This is similar to what plot.network(...) does in the R package statnet. This does not work well for \(\geq 500\) or so vertices. Since this is a randomized algorithm, to make this reproducible, you need to set a seed in the pseudo-random number generator using set.seed(...).

Visualizing the Adjacency Matrix

heatmap(A1, col=grey(c(0.9,0.1)), symm=TRUE)

plot of chunk unnamed-chunk-7

This is useful to see structural patterns which we might not see when we draw a graph. We divide a box into an \(n \times n\) grid and fill in the \(i,j\) box with a color corresponding to the \(i,j\) entry in the matrix. The argument col=grey(c(0.9, 0.1)) tells us that the 0s get colored light grey and the 1s get colored dark grey. Unfortunately, ggplot does not have a good method to plot heatmaps, so if you want to edit them, you need to use the base graphics in R.

For this visualization to be useful at all, you need to be able to order the vertices in some meaningful way. The function heatmap automatically reorders the vertices to cluster them so that rows which are similar are near each other and columns that are similar are near each other (but forcing the matrix to be symmetric since it is an adjacency matrix). The symm=TRUE means that the order of the rows and the order of the columns must be the same in A1, not that the matrix itself is symmetric. Later we will see how we might want to force a particular ordering that might be more informative.

In microarray data, you often see a heat maps where the rows and columns are ordered according to a hierarchical tree clustering order (as on the Wikipedia page). We will use that same idea in producing adjacency heatmap images in the next section.

STRING Data

Read the Wikipedia article about STRING. The STRING database is http://www.string-db.org. Read the help section “Getting Started” here http://string-db.org/help/index.jsp?topic=/org.string-db.docs/howto.html. Alternatively, you can get there by clicking on Help/Info at the top and navigating to that section.

The protein Cyclin B1 is encoded by the CCNB1 gene. You can read about it on wikipedia here: http://en.wikipedia.org/wiki/Cyclin_B1.

Go to http://www.string-db.org.
Enter CCNB1 as the protein name and Homo sapiens as the organism. Click “Continue!”
Select the option with protein CCNB1 (the top one).
Scroll down to “info and Parameters …”

4a. For Active Prediction Methods – unselect everything except “Co-Expression”

4b. For required confidence – select “highest confidence (0.900)

4c. For interactors shown – select “no more than 50 interactors”

4d. For additional (white) nodes – select “100” (these are nodes two steps away from CCNB1)

4e. Click “Update Parameters”. You should get something that looks like the image below.

Click “save” under the picture (showing a diskette).
This will open up a new window so you can choose which format to save the data.
Scroll down to the “Text Summary (TXT - simple tab delimited flatfile)” file and save that document as ccnb1datsmall.txt.
Follow the example in R code below to open the file in R and then make an image using statnet.

This (above) is the network of the CCNB1 2 step neighborhood with co-expression levels \(\geq\) 0.900.

The data on this website was collected on April 11, 2012 and might be different than what is now there. To reproduce the output below, download the file ccnb1datsmall.txt from the course data website.

##########
### read in the STRING data from the txt file you saved
##########

dat <- read.table("ccnb1datsmall.txt", header=T, comment.char = "")

v <- levels(unlist(dat[,1:2]))        # vertex names
n <- length(v)                        # number of vertices
e <- matrix(match(as.character(unlist(dat[,1:2])), v),ncol=2) # edge list
w <- dat$coexpression                 # edge weights
# M is our co-expression network adjacency matrix.
# Since the STRING data only says if proteins i and j are co-expressed and doesn't
# distringuish between i,j and j,i we want to make M symmetric (undirected) by
# considering the weight on i,j is the same as from j,i.
M <- matrix(0, n, n)                  # set up a co-expression matrix
M[e] <- w                             # fill it in with edge weights
M <- M + t(M)                         # make this symmetric
dimnames(M) <- list(v, v)             # label the vertices
# A is our co-expression graph adjacency matrix:
# We let A_ij = 1 if there if if are coexpressed
A <- 1*(M > 0)


##########
### first plot the graph (being lazy and using default plotting parameters)
##########

# make a network object (this uses statnet)
net <- network(e, directed=FALSE)
# we could aslo do `net <- network(A, directed=FALSE)`
set.seed(1)          # make the plot look the same as mine
par(mar=rep(0,4))    # make plot margins 0
plot(net, label=v)   # plot the network and label the vertices

plot of chunk unnamed-chunk-8

Again, we’ll be lazy and just use defaults in making a heatmap except for changing the colors.

# make a heatmap image with colors: white below 0.9,
# grey from 0.9 to 1 with darker closer to 1
breaks <- c(0, seq(0.9, 1, length=11)) # breaks in the color bins
cols <- grey(1-c(0,seq(0.5,1,length=10))) # colors

# color the vertex for CCNB1 blue, its neighbors red, and the others white
ccnb1ind <- which(v == "CCNB1") # index for ccnb1
vcols <- rep("white",n)
vcols[ccnb1ind] <- "blue"
vcols[which(M[,ccnb1ind]>0 | M[ccnb1ind,])] <- "red"

# now actually make the heat map
par(mar=rep(0,4)) # make plot margins 0
heatmap(M, symm=TRUE, ColSideColors=vcols, RowSideColors=vcols,
        col=cols, breaks=breaks,  frame=T)
legend("topleft", c("Neighbors(CCNB1)","CCNB1"), fill=c("red","blue"),  
       bty="n", inset=0, xpd=T,  border=F)

plot of chunk unnamed-chunk-9 CCNB1 network – 2 step neighborhood with co-expression levels \(\geq\) 0.900, generated from R (darker is closer to 1, we ignore values < 0.9).

This gives a visualization of the strongest interactions in the two step neighborhood of CCNB1. Both the plotted graph and the heatmap image show the same data: there seems to be a cluster of proteins which are all similar to CCNB1 and there is a cluster of other proteins. Many of the proteins in the CCNB1 cluster are coexpressed at the same time as each other. Why might this be the case? Conversely, proteins which are coexpressed with a protein that is coexpressed with CCNB1 (two steps away) do not tend to be coexpressed with each other. Is it easier for you to see this in one of the figures (the plot or the heatmap) than the other?

Associative Clustering versus Structural Clustering

This brings us to the point that there are 2 types of ways we can think about “clustering” vertices. The first is that there are lots of edges between points in a cluster and not many going outside of that cluster. This is called associative clustering because the clustering describes groups of vertices that associate with each other in terms of edges.

In this data though, we see that there is another type of clustering also – this is evident from the dendrogram in the heatmap image. There is a cluster where there are almost no edges within the cluster, but all the vertices in that cluster have the same type of structure. In particular, the cluster on the left (top) of the heatmap image have very few edges within the cluster, have and have a few edges to vertices in the other cluster. When we can divide a heatmap image into well defined blocks like this, then we say that there is structural clustering. Unfortunately for biologists, structural clustering has not been studied as in depth as associative clustering in biology. Much of the research on structural clustering comes from sociology where this is often analyzed using “stochastic blockmodel analysis”. To show associative clustering, do you think that we can say whether graph plots or heatmap images usually work better (or does it depend too much on the particular graph)? What about for structural clustering?

A Bigger Data Set

Remember that it is almost always a bad idea to throw information away. When we thresholded the dataset above at co-expression levels of 0.9 or higher, we essentially threw away all the information for other coexpression levels.

In the STRING database website, if you change the confidence level to “low confidence (0.150)” then we get much more information. When we do this, we should change the number of interactors to be very high (e.g., 1000) and additional white nodes to be very high (e.g., 1000). Don’t actually do this because it takes a long time to gather this data. If you were to do this in April 2012, you’d get 603 vertices and 29,523 edges. This is too big to easily be visualized as a plot, but this richer dataset is what we’ll use for network analytics. Also, it is much better to start with too much information and look at subsets of that information than it is to limit your initial data. That full dataset can be downloaded here: ccnb1dat.txt.

You can use the same code as above to read it in, but do not use the plot() function unless you want to wait for it for a long time.

This is the dataset we’ll use for some of the homework questions.

E Coli Gene Expression Data Example

We will be looking at gene expression data in E. Coli. The example we are using is based on the work from “Comparing Statistical Methods for Constructing Large Scale Gene Networks” by Allen, et al. and can be found here: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0029348

DON’T DO THIS NOW, but, here’s how we get the data.

go to: http://m3d.mssm.edu/
click “Normalized data”
download the most recent one (E_coli_v4_Build_6.tar.gz) from 3-Sep-2009
extract the file avg_E_coli_v4_Build_6_exps466probes4297.tab
extract the file E_coli_v4_Build_6.probe_set_descriptions

Download the following files from our data website:

avg_E_coli_v4_Build_6_exps466probes4297.tab
E_coli_v4_Build_6.probe_set_descriptions

The .tab file is about 30Mb, so download it once and save it somewhere. You can also download the .zip version of the file.

########
### preamble
########

require(Matrix) # convenient to deal with sparse matrices, has sparseMatrix() in it
  
########
### read in, clean, and construct data
########

# read in observed data
data.ecoli.orig <- read.table("avg_E_coli_v4_Build_6_exps466probes4297.tab", 
                        row.names=1, header=T, sep="\t")
data.ecoli <- t(data.ecoli.orig)
  
# read in probe set names so we can change use standard "friendly" names
data.names.tab <- read.table("E_coli_v4_Build_6.probe_set_descriptions",
                             row.names=1, header=T, sep="\t")

# check to see we're changing names correctly
cbind(colnames(data.ecoli)[1:10], rownames(data.names.tab)[
  match(rownames(data.ecoli.orig), rownames(data.names.tab))][1:10])

##       [,1]            [,2]           
##  [1,] "aaaD_b4634_3"  "aaaD_b4634_3" 
##  [2,] "aaeA_b3241_14" "aaeA_b3241_14"
##  [3,] "aaeB_b3240_15" "aaeB_b3240_15"
##  [4,] "aaeR_b3243_15" "aaeR_b3243_15"
##  [5,] "aaeX_b3242_12" "aaeX_b3242_12"
##  [6,] "aas_b2836_14"  "aas_b2836_14" 
##  [7,] "aat_b0885_14"  "aat_b0885_14" 
##  [8,] "abgA_b1338_14" "abgA_b1338_14"
##  [9,] "abgB_b1337_15" "abgB_b1337_15"
## [10,] "abgR_b1339_15" "abgR_b1339_15"

# change names to friendly_name
colnames(data.ecoli) <- tolower(data.names.tab[
  match(rownames(data.ecoli.orig), rownames(data.names.tab)),"friendly_name"])

# this takes a while to calculate: it's a very big correlation matrix
cor.ecoli <- cor(data.ecoli)
dim(cor.ecoli)

## [1] 4297 4297

Network Analysis

There are may different network analytics that people are interested in. We will focus on clustering.

We will do “spectral clustering” by doing k-means clustering on the first several eigenvectors of the symmetric normalized graph Laplacian. We’ll do this on the correlation matrix of the E. Coli gene expression data from above. This tends to do well for associative clustering (since we’re looking at eigenvectors).

First, we construct a graph adjacency matrix from the correlation matrix by thresholding the values: everything about 0.85 is counted as an edge, and everything below that is counted as a non-edge. This value of 0.85 was chosen so that there would be a about 200 vertices in the plot and does not have any specific biological significance. If you do this on your own data, you should probably have a better reason for picking a cutoff value. We treat the thresholded data as an adjacency matrix and look at the connected components of the corresponding graph.

# Make a similarity matrix by thresholding the correlation matrix
Sfull <- Matrix(1*(cor.ecoli > 0.85))

# connected component distribution (have to use matrix instead of Matrix here)
Scd <- component.dist(as.matrix(Sfull))
table(Scd$csize) # 3022 isolated vertices and 1 component with 219 vertices

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   18 
## 3022  176   50   21   12    9    2    3    4    2    2    1    2    1    1 
##   29   57   84  219 
##    1    1    1    1

The function component.dist will give us information on the connected components in the graph of this full adjacency matrix. We’re only going to be concerned with the largest connected component in this lab (although in practice, you should look at all the connected components which seem larger than you might expect from a Poisson distribution – in this case, the components of size 29, 57, 84, and 219). We’ll select all of the vertices in the largest connected component (LCC).

# which vertices are in the largest connected component
lcc.ind <- which(Scd$membership == which.max(Scd$csize))
  
# We'll call the LCC "S" and just work with this component
S <- Sfull[lcc.ind,lcc.ind]
n <- dim(S)[1]
n

## [1] 219

Now we’ll look at the eigendecomposition of the symmetric normalized graph Laplacian. For those of you familiar with topology, this is the discrete equivalent of a Laplacian. When we construct this, we make an identity matrix minus a scaled adjacency matrix – since we’re subtracting the interesting part of the matrix, we’re concerned with the smallest eigenvectors (except the very smallest which will has eigenvalue 0).

# degree distribution
d <- as.vector(S %*% rep(1,n)) #symmetric, so we can pre or post multiply
  
# Laplacians (symmetric normalized)
Lsym <- Diagonal(n) - (Diagonal(x=d^(-1/2)) %*% S %*% Diagonal(x=d^(-1/2)))
  
# eigen decomposition (takes a few seconds -- R doesn't have a sparse eigen)
eLsym <- eigen(Lsym)

# decide how many eigenvectors to use: look at the smallest ones (except the last)
plot(eLsym$values[n-1:25], xlab="i", ylab="(n-i)th eigenvalue", 
     main="smallest eigenvalues")
abline(v=seq(0,25,by=5),col=rgb(0,0,0,0.2))

plot of chunk unnamed-chunk-13 We want to find the jump in the eigenvalues. To me, it looks like there are jumps between the (n-1) and (n-2) eigenvalues, (n-3) and (n-4) eigenvalues, (n-8) and (n-9) eigenvalues. This means we want to consider either 1, 3, or 8 eigenvectors. Remember that we do not include the very smallest since it will have eigenvalue 0 by construction and the eigenvector will just be noise from floating point accuracy.

# Use 3 dimensions for the eigenvectors of the symmetric normalized Laplacian.
eLV <- eLsym$vectors[,n-1:3]

We want to do some sort of clustering so vertices that are similar to each other get clustered together. For this lab, we’ll use k-means clustering. We’ll keep track of the total within sum of squares (TWSS) for each number of clusters. We want this to be very small without including lots of clusters. The heuristic that we’ll use is to look for the elbow in the TWSS plotted versus total number of clusters.

Note that k-means is a randomized algorithm, and for any fixed k, we’re only interested in the result which has the smallest TWSS. Here, we’ll start with 100 randomized starts for each number of clusters. We’ll look at anywhere from 1 to 10 clusters.

# kmeans uses random numbers, so let's make this repeatable
set.seed(2)

# calculate total within sums of squares for k means clustering
kmWCSS <- sum(sweep(eLV,2,apply(eLV,2,mean))^2) # total within sum of squares
nclusts <- 10 # largest number of clusters we'll consider
nstart <- 100 # how many random starts we use
kmWCSS[2:nclusts] <- unlist(lapply(2:nclusts, function(i){
  sum(kmeans(eLV, i, nstart=nstart)$withinss)}))

# plot these TWSS versus number of cluters
plot(1:nclusts,kmWCSS)
abline(v=seq(0,nclusts,by=5),col=rgb(0,0,0,0.2))

plot of chunk unnamed-chunk-15

# There's an "elbow" at 4 clusters, so let's use 4 clusters.
my.nclusts <- 4
my.km <- kmeans(eLV, my.nclusts, nstart=nstart)
  
# draw this graph
Snet <- network(as.matrix(S), dir=F)
set.seed(1)
par(mar=rep(0,4)+0.1) # makes the margins smaller

# Save the coordinates, and for now just plot the edges.
coords <- plot(Snet, vertex.col=0, vertex.border=0, edge.col=rgb(0,0,0,0.7)) 
  
# to plot this quickly again, you can put in the coords from before
#plot(Snet, coord=coords, vertex.col=0, vertex.border=0, edge.col=rgb(0,0,0,0.7))
  
# add vertices with cluster colors and different shapes
my.cols <- rainbow(my.nclusts, alpha=0.8)
points(coords, col=my.cols[my.km$cluster], pch=my.km$cluster, lwd=2)

plot of chunk unnamed-chunk-15

What happens when we want to do hierarchical clustering instead of k-means clustering?

my.dend <- hclust(dist(eLV), method="ward.D")
plot(my.dend)

plot of chunk unnamed-chunk-16 We want to include as many clusters as we feel are actually well separated. This corresponds to drawing a horizontal line across here to cut the tree into clusters. It looks like we should not use more than 7 clusters and we should definitely use at least 4 clusters. Try changing the number of clusters from 4 to 7 to decide which number of clusters you like most.

cut.height <- 5 # try numbers from 4 to 7
my.hclusts <- cutree(my.dend, cut.height)
my.cols <- rainbow(cut.height, alpha=0.8)
plot(Snet, coord=coords, vertex.col=0, vertex.border=0, edge.col=rgb(0,0,0,0.7))
points(coords, col=my.cols[my.hclusts], pch=my.hclusts, lwd=2)

plot of chunk unnamed-chunk-17 You can also try changing method="ward.D" to something else. Hierarchical clustering has the very nice property that it is agglomerative (unlike k-means), so it preserves some interpretability of subsets, whereas k-means has the nice property that it minimizes the TWSS.

A great feature of doing hierarchical clustering is that now we have a dendrogram so we can make a heatmap of the adjacency matrix to correspond to the plot of the graph.

vcols <- my.cols[my.hclusts]
par(mar=rep(0,4)) # make plot margins 0
# cols = 0:1 is white for 0 and black for 1
heatmap(as.matrix(S), Rowv = as.dendrogram(my.dend), symm=TRUE, 
        ColSideColors=vcols, RowSideColors=vcols,  col=0:1, frame=T)

plot of chunk unnamed-chunk-18 Now it is easier to see that there are some structures within the red cluster.