Shirley Wu
:: research ::


Current projects

The Structural Genomics initiative currently spans 3 continents and 20 institutions in an effort towards high-throughput structure determination. One goal is to add to the diversity of known structures by solving structures with folds not found in the PDB. Structural genomics centers have already contributed thousands of new structures to the PDB since the initiative's inception in 2000. Since many of these structures contain new, uncharacterized folds and have no known function, it is important not only that accurate tools for predicting protein function from structure are created, but that we also develop ways to discover potentially novel biological sites in these structures. Liping Wei and others in the Altman lab previously developed FEATURE, a system for modeling and predicting functional sites in protein structures, and Mike Liang recently extended their work to automate the process of building models. My work has focused first on applying this automated process to build a library of 3D functional site models, scanning the entire PDB for functional sites, and making the tools and data available online; the second aspect of my work has been to develop an approach for identifying compelling clusters of protein structure sites and characterizing them with the goal of discovering novel annotations, motifs, and functions.

SeqFEATURE
The idea behind SeqFEATURE is to develop a pipeline for generating structural models of functional sites quickly and automatically. To do this, we take 1D functional motifs (such as PROSITE) and extract examples of them from the PDB for use as a structural training set. We've used this principle to build a library of over 100 models which can be accessed via our online tool, WebFEATURE.

For more information about SeqFEATURE, please refer to our recent publication:

Wu S, Liang MP, Altman RB. (2008) The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation. Genome Biology. 9:R8. [full text]

For more information about FEATURE, please refer to our recent review:

Halperin I, Glazer DS, Wu S, Altman RB. (2008) The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC Genomics. 9(Suppl 2):S2. [full text]

Clustering for discovery of protein functions
The lab has recently published preliminary work on clustering FEATURE vectors from a non-redundant subset of the PDB. Using binary features and k-means clustering with a weighted Hamming distance, we were able to rediscover PROSITE motifs within our clustering. Currently, we are looking into more efficient and more effective clustering methods, including hierarchically clustering within clusters obtained using k-means as a way of refining the original clustering for use in building additional functional site models and for annotation. Measures of coherence, both internal and external, will help us to select optimal clusters for analysis which are more likely to represent biologically relevant sites. We can then use information from literature and databases to characterize these optimal clusters. Any clusters representing novel motifs or functions that we discover can then become input for new function prediction models.

Past projects

Automatic phosphorylation site recognition
Protein phosphorylation is essential for regulating and effecting many important biological processes. Phosphorylation in eukaryotes, in which a phosphate is added to serine, threonine, or tyrosine residues, is crucial for cell signaling and is an especially widespread phenomenon Ð over 1/3 of the human proteome is estimated to be phosphorylatable. Because of this, the study of protein kinases, phosphorylated proteins, and biological pathways regulated by phosphorylation is particularly significant for drug design. Yet only a few thousand phosphorylation sites are currently known. It is difficult to identify these sites experimentally, and accurate phosphorylation site prediction algorithms would expedite the process and therefore contribute to our understanding of phosphorylation and phosphorylation-mediated pathways in the cell.

A number of methods exist for predicting phosphorylation sites, including the popular ScanSite and NetPhos, and less well-known KinasePhos, DISPHOS, and GPS. However, the level of specificity offered by all of these methods is not high enough to justify their large-scale application. Part of the problem may be that almost all of them consider only local sequence data important for discriminating phosphorylation sites. Structure has already been shown to play a role in kinase-substrate recognition, and interactions distal to the phosphorylation site, such as substrate recruitment sites, are also important for defining good substrates in vivo. There is thus an exciting opportunity here to study the local and global structural features that characterize phosphorylation sites and incorporate them into an improved prediction algorithm.

In this work, I studied the local structural environment around phosphorylation sites, and investigated the potential usefulness of substrate recruitment sites as well. The goal was to identify conserved structural features that we could then incorporate into a model for predicting phosphorylation sites with better performance than existing methods.

Biological pathways and visualization
Biological pathways describe the various molecules and interactions involved in biological processes. As such, they are an important way that we as scientists synthesize and represent our knowledge of these processes. Pathways can be arbitrarily complex in biology, yet they are mentally intuitive. In the past few years, pathway data has become increasingly prevalent in databases and literature, and it is extremely important that this data be displayable, exchangeable, and malleable in a high-throughput manner. Several exchange formats for biological pathway data have been proposed, with the most recent, BioPAX, emerging as the logical choice to be the community standard.

In addition to data exchange, there is a need to visualize biological pathway data appropriately in a dynamic and computational way. The paper-based and manually created diagrams we are used to simply are not scalable to the rate at which pathway data is now gathered and changed. Yet pathway diagrams should still retain the clarity, descriptiveness, and attractiveness of traditional hand-made drawings. Although a number of tools for analyzing and visualizing pathways exist (GenMAPP, VisAnt, Cytoscape), none of them (at the time) imparted complete BioPAX compatibility, and many of them are visually unappealing. As part of a biomedical informatics project course, I and two others created a software application for visualizing BioPAX-encoded pathway data, Biological Pathway and Visualization Editor (BioPAVE). BioPAVE consisted of a graphical interface wherein new pathways could be created and existing pathways edited using drag-and-drop tools and attribute windows. Data could be imported and exported as BioPAX code, with optional layout coordinates contained in a separate file.

My lab: The Helix Group.
Relevant projects: FEATURE, WebFEATURE
Academics: Stanford Biomedical Informatics