Lab 7: Phyloseq

Importing phyloseq data

Load the phyloseq package:

require("phyloseq")
packageVersion("phyloseq")

## [1] '1.4.5'

require(ggplot2)

The custom functions that read external data files and return an instance of the phyloseq-class are called “importers”. Validity and coherency between data components are checked by the phyloseq-class constructor, phyloseq() which is invoked internally by the importers, and is also the suggested function for creating a phyloseq object from “manually” imported data. The component indices representing OTUs or samples are checked for intersecting indices, and trimmed/reordered such that all available (non-) component data describe exactly the same OTUs and samples, in the same order.

Currently available import functions

See ?import after phyloseq has been loaded (library("phyloseq")), to get an overview of available import functions and documentation links to their specific doc pages, or see below for examples using some of the more popular importers.

We'll show how to use the import_biom function.

Newer versions of QIIME produce a more-comprehensive and formally-defined JSON file format, called biom file format:

“The biom file format is designed to be a general-use format for representing counts of observations in one or more biological samples. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium candidate project.”

http://biom-format.org/

The phyloseq package includes small examples of biom files with different levels and organization of data. The following shows how to import each of the four main types of biom files (in practice, you don't need to know which type your file is, only that it is a biom file). In addition, the import_biom function allows you to simultaneously import an associated phylogenetic tree file and reference sequence file (e.g. fasta).

First, define the file paths. In this case, this will be within the phyloseq package, so we use special features of the system.file command to get the paths. This should also work on your system if you have phyloseq installed, regardless of your Operating System.

rich_dense_biom  = system.file("extdata", "rich_dense_otu_table.biom",  package="phyloseq")
rich_sparse_biom = system.file("extdata", "rich_sparse_otu_table.biom", package="phyloseq")
min_dense_biom   = system.file("extdata", "min_dense_otu_table.biom",   package="phyloseq")
min_sparse_biom  = system.file("extdata", "min_sparse_otu_table.biom",  package="phyloseq")
treefilename = system.file("extdata", "biom-tree.phy",  package="phyloseq")
refseqfilename = system.file("extdata", "biom-refseq.fasta",  package="phyloseq")

Now that we've defined the file paths, let's use these as argument to the import_biom function. Note that the tree and reference sequence files are both suitable for any of the example biom files, which is why we only need one path for each. In practice, you will be specifying a path to a sequence or tree file that matches the rest of your data (include tree tip names and sequence headers)

import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]

import_biom(rich_sparse_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 5 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]

import_biom(min_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]

import_biom(min_sparse_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5 taxa and 6 samples ]
## phy_tree()    Phylogenetic Tree: [ 5 tips and 4 internal nodes ]
## refseq()      DNAStringSet:      [ 5 reference sequences ]

Example code for importing large file with parallel backend

library("doParallel")
registerDoParallel(cores=6)
import_biom("my/file/path/file.biom", parseFunction=parse_taxonomy_greengenes, parallel=TRUE)

In practice, you will store the result of your import as some variable name, like myData, and then use this data object in downstream data manipulations and analysis. For example,

myData = import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
plot_tree(myData, color="Genus", shape="BODY_SITE", size="abundance")

plot of chunk unnamed-chunk-4

plot_richness(myData, x="BODY_SITE", color="Description")

plot of chunk unnamed-chunk-5

plot_bar(myData, fill="Genus")

plot of chunk unnamed-chunk-6

refseq(myData)

##   A DNAStringSet instance of length 5
##     width seq                                          names               
## [1]   334 AACGTAGGTCACAAGCGTTGT...TTCCGTGCCGGAGTTAACAC GG_OTU_1
## [2]   465 TACGTAGGGAGCAAGCGTTAT...CCTTACCAGGGCTTGACATA GG_OTU_2
## [3]   249 TACGTAGGGGGCAAGCGTTAT...GGCTCGAAAGCGTGGGGAGC GG_OTU_3
## [4]   453 TACGTATGGTGCAAGCGTTAT...AAGCAACGCGAAGAACCTTA GG_OTU_4
## [5]   178 AACGTAGGGTGCAAGCGTTGT...GGAATGCGTAGATATCGGGA GG_OTU_5

Functions for Accessing and (Pre)Processing Data

Load the GlobalPatterns dataset, included with the phyloseq package.

data(GlobalPatterns)

Accessors

Components of a phyloseq object, like the OTU Table, can be accessed by special accessor functions, or “accessors'', which return specific information about phylogenetic sequencing data, if present. These accessor functions are available for direct interaction by users and dependent functions/packages.

GlobalPatterns

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]

ntaxa(GlobalPatterns)

## [1] 19216

nsamples(GlobalPatterns)

## [1] 26

sample_names(GlobalPatterns)[1:5]

## [1] "CL3"     "CC1"     "SV1"     "M31Fcsw" "M11Fcsw"

rank_names(GlobalPatterns)

## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

sample_variables(GlobalPatterns)

## [1] "X.SampleID"               "Primer"                  
## [3] "Final_Barcode"            "Barcode_truncated_plus_T"
## [5] "Barcode_full_length"      "SampleType"              
## [7] "Description"

otu_table(GlobalPatterns)[1:5, 1:5]

## OTU Table:          [5 taxa and 5 samples]
##                      taxa are rows
##        CL3 CC1 SV1 M31Fcsw M11Fcsw
## 549322   0   0   0       0       0
## 522457   0   0   0       0       0
## 951      0   0   0       0       0
## 244423   0   0   0       0       0
## 586076   0   0   0       0       0

tax_table(GlobalPatterns)[1:5, 1:4]

## Taxonomy Table:     [5 taxa by 4 taxonomic ranks]:
##        Kingdom   Phylum          Class          Order         
## 549322 "Archaea" "Crenarchaeota" "Thermoprotei" NA            
## 522457 "Archaea" "Crenarchaeota" "Thermoprotei" NA            
## 951    "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales"
## 244423 "Archaea" "Crenarchaeota" "Sd-NA"        NA            
## 586076 "Archaea" "Crenarchaeota" "Sd-NA"        NA

phy_tree(GlobalPatterns)

## 
## Phylogenetic tree with 19216 tips and 19215 internal nodes.
## 
## Tip labels:
##  549322, 522457, 951, 244423, 586076, 246140, ...
## Node labels:
##  , 0.858.4, 1.000.154, 0.764.3, 0.995.2, 1.000.2, ...
## 
## Rooted; includes branch lengths.

taxa_names(GlobalPatterns)[1:10]

##  [1] "549322" "522457" "951"    "244423" "586076" "246140" "143239"
##  [8] "244960" "255340" "144887"

myTaxa = taxa_names(GlobalPatterns)[1:10]
plot(phy_tree(prune_taxa(myTaxa, GlobalPatterns)))

plot of chunk gp-sample-names

Preprocessing

The phyloseqBase package also includes functions for filtering, subsetting, and merging abundance data. Filtering in phyloseq is designed in a modular fashion similar to the approach in the genefilter package. This includes the prune_taxa and prune_samples methods for directly removing unwanted indices, as well as the filterfun_sample and genefilter_sample functions for building arbitrarily complex sample-wise filtering criteria, and the filter_taxa function for taxa-wise filtering. In the following example, the GlobalPatterns data is first transformed to relative abundance, creating the new GPr object, which is then filtered such that all OTUs with a variance greater than 10^-5 are removed. This results in a highly-subsetted object containing just 177 of the original ~19000 OTUs (GPfr below). Note that in both lines we have provided a custom function for transformation and filtering, respectively.

GPr  = transform_sample_counts(GlobalPatterns, function(x) x / sum(x) )
GPfr = filter_taxa(GPr, function(x) var(x) > 1e-5, TRUE)

The subsetting methods prune_taxa and prune_samples are for cases where the complete subset of desired OTUs or samples is directly available. Alternatively, the subset_taxa and subset_samples functions are for subsetting based on auxiliary data contained in the Taxonomy Table or Sample Data components, respectively. These functions are analogous to the subset function in core R, in which the initial data argument is followed by an arbitrary logical expression that indicates elements or rows to keep. Thus, entire experiment-level data objects can be subset according to conditional expressions regarding the auxiliary data. For example, the following code will first assign to GP.chl the subset of the GlobalPatterns dataset that are part of the Chlamydiae phylum, and then remove samples with less than 20 total reads.

GP.chl = subset_taxa(GlobalPatterns, Phylum=="Chlamydiae")
GP.chl = prune_samples(sampleSums(GP.chl)>=20, GP.chl)

Merging methods include merge_taxa and merge_samples, intended for merging specific OTUs or samples, respectively. There is also the merge_phyloseq function for a complete merge of two or more phyloseq-objects (or a phyloseq-object and one or more separate components). For example, the following code merges the first 5 OTUs in the Chlamydiae-only dataset.

GP.chl.merged = merge_taxa(GP.chl, taxa_names(GP.chl)[1:5])

Building on the merge_taxa methods, the phyloseq-package includes the agglomeration functions, tip_glom and tax_glom, for merging all OTUs in an experiment that are similar beyond a phylogenetic or taxonomic threshold, respectively. The following code demonstrates how to agglomerate the "Bacteroidetes-only” dataset (called gpsfb) at the taxonomic rank of Family, and create an annotated tree of the result.

gpsfb = subset_taxa(GPfr, Phylum == "Bacteroidetes")
gpsfbg = tax_glom(gpsfb, "Family")
plot_tree(gpsfbg, color="SampleType", shape="Class", size="abundance")

plot of chunk tip-glom-tree-supp-intext

For transforming abundance values by an arbitrary R function, phyloseqBase includes the transform_sample_counts function. It takes as arguments a phyloseq-object and an R function, and returns a phyloseq-object in which the abundance values have been transformed, sample-wise, according to the transformations specified by the function. For example, the following command transforms GP.chl abundance counts to fractional abundance.

transform_sample_counts(GP.chl, function(OTU) OTU/sum(OTU) )

Finally, the following is the remaining set of preprocessing steps that was applied to the GlobalPatterns OTU counts prior to creating the figures in the main phyloseq manuscript.

Remove taxa not seen more than 3 times in at least 20% of the samples. This protects against an OTU with small mean & trivially large C.V.

GP = filter_taxa(GlobalPatterns, function(x) sum(x > 3) > (0.2*length(x)), TRUE)

Define a human versus non-human categorical variable, and add this new variable to sample data:

sample_data(GP)$human = factor( get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue") )

Standardize abundances to the median sequencing depth

total = median(sample_sums(GP))
standf = function(x, t=total) round(t * (x / sum(x)))
gps = transform_sample_counts(GP, standf)

Filter the taxa using a cutoff of 3.0 for the Coefficient of Variation

gpsf = filter_taxa(gps, function(x) sd(x)/mean(x) > 3.0, TRUE)

Subset the data to Bacteroidetes, used in some plots

gpsfb = subset_taxa(gpsf, Phylum=="Bacteroidetes")

graphic summary

Now let's summarize this slice of the data with some graphics.

title = "plot_bar; Bacteroidetes-only"
plot_bar(gpsfb, "SampleType", "Abundance", title=title)

plot of chunk unnamed-chunk-12

plot_bar(gpsfb, "SampleType", "Abundance", "Family", title=title)

plot of chunk unnamed-chunk-12

plot_bar(gpsfb, "Family", "Abundance", "Family", 
         title=title, facet_grid="SampleType~.")

## Warning: Removed 336 rows containing missing values (position_stack).
## Warning: Removed 168 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 252 rows containing missing values (position_stack).
## Warning: Removed 168 rows containing missing values (position_stack).

plot of chunk unnamed-chunk-13

The distance function in phyloseq

The distance function takes a phyloseq-class object and method option, and returns a dist-class distance object suitable for certain ordination methods and other distance-based analyses. There are currently 44 explicitly supported method options in the phyloseq package, as well as user-provided arbitrary methods via an interface to vegan::designdist. For the complete list of currently supported options/arguments to the method parameter, type distance("list") in the command-line of your R session. Only sample-wise distances are currently supported (the type argument), but eventually OTU-wise (e.g. species) distances will be supported as well.

See the in-package documentation of distance for further details:

?distance

Example: “Enterotypes” dataset using many different methods

Because the distance() function organizes distance calculations into one function, it is relatively straightforward to calculate all supported distance methods and investigate the results. The following code will perform such a loop on the “Enterotypes” dataset, perform multi-dimensional scaling (a.k.a. principle coordinates analysis), and plot the first two axes, shading and shaping the points in each plot according to sequencing technology and assigned “Enterotype” label.

Note that we have omitted the options that require a phylogenetic tree because the "enterotype" example dataset currently included in the phyloseq-package does not have one.

Note that this may take a little while to run, depending on the size of your data set, but you may not be interested in all supported distances…

Load the enterotype data

data(enterotype)

Remove the OTUs that included all unassigned sequences ("-1")

enterotype <- subset_species(enterotype, Genus != "-1")

The available distance methods coded in distance

dist_methods <- unlist(distance("list"))
print(dist_methods)

##      UniFrac        DPCoA          JSD     vegdist1     vegdist2 
##    "unifrac"      "dpcoa"        "jsd"  "manhattan"  "euclidean" 
##     vegdist3     vegdist4     vegdist5     vegdist6     vegdist7 
##   "canberra"       "bray" "kulczynski"    "jaccard"      "gower" 
##     vegdist8     vegdist9    vegdist10    vegdist11    vegdist12 
##   "altGower"   "morisita"       "horn"  "mountford"       "raup" 
##    vegdist13    vegdist14    vegdist15   betadiver1   betadiver2 
##   "binomial"       "chao"        "cao"          "w"         "-1" 
##   betadiver3   betadiver4   betadiver5   betadiver6   betadiver7 
##          "c"         "wb"          "r"          "I"          "e" 
##   betadiver8   betadiver9  betadiver10  betadiver11  betadiver12 
##          "t"         "me"          "j"        "sor"          "m" 
##  betadiver13  betadiver14  betadiver15  betadiver16  betadiver17 
##         "-2"         "co"         "cc"          "g"         "-3" 
##  betadiver18  betadiver19  betadiver20  betadiver21  betadiver22 
##          "l"         "19"         "hk"        "rlb"        "sim" 
##  betadiver23  betadiver24        dist1        dist2        dist3 
##         "gl"          "z"    "maximum"     "binary"  "minkowski" 
##   designdist 
##        "ANY"

Remove the two distance-methods that require a tree, and the generic custom method that requires user-defined distance arguments.

# These require tree
dist_methods[(1:2)]

##   UniFrac     DPCoA 
## "unifrac"   "dpcoa"

# Remove them from the vector
dist_methods <- dist_methods[-(1:2)]
# This is the user-defined method:
dist_methods["designdist"]

## designdist 
##      "ANY"

# Remove the user-defined distance
dist_methods = dist_methods[-which(dist_methods=="ANY")]

Loop through each distance method, save each plot to a list, called plist.

plist <- vector("list", length(dist_methods))
names(plist) = dist_methods
for( i in dist_methods ){
  # Calculate distance matrix
    iDist <- distance(enterotype, method=i)
    # Calculate ordination
    iMDS  <- ordinate(enterotype, "MDS", distance=iDist)
    ## Make plot
    # Don't carry over previous plot (if error, p will be blank)
    p <- NULL
    # Create plot, store as temp variable, p
    p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
    # Add title to each plot
    p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
    # Save the graphic to file.
    plist[[i]] = p
}

Combine all results

Shade according to sequencing technology

require(plyr)
df = ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p

plot of chunk distance-summary-plot

Selected Results

The following are some selected examples among the created plots.

Jensen-Shannon Divergence

print(plist[["jsd"]])

plot of chunk unnamed-chunk-20

Jaccard

print(plist[["jaccard"]])

plot of chunk unnamed-chunk-21

Bray-Curtis

print(plist[["bray"]])

plot of chunk unnamed-chunk-22

Gower

print(plist[["gower"]])

plot of chunk unnamed-chunk-23

print(plist[["w"]])

plot of chunk unnamed-chunk-24

Clustering and the gap statistic

One of the problems in clustering is to figure out how many clusters there are. One way to do this is with the gap statistic. We'll illustrate with the enterotype data.

require(cluster)
exord = ordinate(enterotype, method="MDS", distance="jsd")

Compute the gap statistic with

pam1 = function(x, k){list(cluster = pam(x,k, cluster.only=TRUE))}
x = phyloseq:::scores.pcoa(exord, display="sites")
# gskmn = clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn = clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 50)

## Clustering k = 1,2,..., K.max (= 6): .. done
## Bootstrapping, b = 1,2,..., B (= 50)  [one "." per sample]:
## .................................................. 50

gskmn

## Clustering Gap statistic ["clusGap"].
## B=50 simulated reference sets, k = 1..6
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
##       logW E.logW    gap  SE.sim
## [1,] 2.712  3.031 0.3197 0.02008
## [2,] 2.328  2.755 0.4276 0.02219
## [3,] 2.063  2.520 0.4568 0.02178
## [4,] 1.823  2.319 0.4959 0.02225
## [5,] 1.732  2.210 0.4772 0.01937
## [6,] 1.566  2.115 0.5493 0.01864

Ordination methods

Ordination methods can be a useful tool for exploring complex phylogenetic sequencing data, particularly when the hypothesized structure of the data is poorly defined (or there isn't a hypothesis). The phyloseq package provides some useful tools for performing ordinations and plotting their results, via the ordinate() and plot ordination() functions, respectively. Although there are many options and methods supported, a first-step will probably look something like the following:

my.physeq <- import("Biom", BIOMfilename = "myBiomFile.biom")
my.ord <- ordinate(my.physeq)
plot_ordination(my.physeq, my.ord, color = "myFavoriteVarible")

We will show continue our exploration of the “GlobalPatterns” dataset using various features of an ordination method called Correspondence Analysis. Let's limit the number of species:

data(GlobalPatterns)
# Take a subset of the GP dataset, top 200 species
topsp <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:200])
GP <- prune_species(topsp, GlobalPatterns)
# Subset further to top 5 phyla, among the top 200 OTUs.
top5ph <- sort(tapply(taxa_sums(GP), tax_table(GP)[, "Phylum"], sum), decreasing = TRUE)[1:5]
GP <- subset_species(GP, Phylum %in% names(top5ph))
# Re-add human variable to sample data:
sample_data(GP)$human <- factor( get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue") )

First see how the samples behave on the first few CA axes:

gpca <- ordinate(GP, "CCA")
barplot(gpca$CA$eig/sum(gpca$CA$eig), las = 2)

plot of chunk unnamed-chunk-29

Now the correspondence analysis on axes (1,2) and (3,4)

(p12 <- plot_ordination(GP, gpca, "samples", color = "SampleType") + geom_line() + geom_point(size = 5))

plot of chunk unnamed-chunk-30

(p34 <- plot_ordination(GP, gpca, "samples", axes = c(3, 4), color = "SampleType") + geom_line() + geom_point(size = 5))

plot of chunk unnamed-chunk-30

You can now do the homework associated with this lab, which is at https://web.stanford.edu/class/bios221/cgi-bin/index.cgi/.