| |
|
|
|
|
|
|
| |
|
|
|
| |
Welcome to the
Wong Lab.
A) OVERVIEW
Our group develops methods and software for the analysis
of the data from high throughput genomics projects.
Particular interests include the analysis of gene
expression profiles and cis-regulatory sequences. We are
working closely with collaborating laboratories in
investigations of cancer and developmental biology. We
develop and enhance tools in exploratory data analysis,
multivariate analysis, information theory, machine
learning, Monte Carlo, graph theory, linear and nonlinear
differential equations, and applied them to problems in
computational and systems biology.
B) METHODOLOGICAL RESEARCH
1) Microarray analysis
The current generation of oligonucleotide expression
arrays for human or mouse contain enough probe sets to
measure the expression levels of almost all mammalian
genes. Since the probe sequences are now available from
NetAffx, we have the opportunity to identify, for every
probe on the array, all transcripts that have the
potential to hybridize to the probe. With this
information, we are now developing de-convolution
algorithms aiming to correct for cross-hybridizing signal
contributed by non-target transcripts, thereby allowing
more accurate estimates for the levels of
low-concentration transcripts.
We are also developing methods for handling non-expression
arrays. For example, we work with the Murray lab to
analyze tag-probe arrays from yeast competition
experiments. We have an on-going collaboration with
Affymetrix to develop methods for analyzing data from
their 100 SNP arrays, both in the context of linkage and
association studies, and for detection of chromosomal copy
number aberrations in cancer cells.
2) Cis-regulatory analysis and comparative genomics
We are interested in understanding the co-expression
patterns observed from microarray gene expression
experiments. To this end we develop methods for the
identification of regulatory sequence elements in the
upstream or intronic regions of genes showing
co-expression. We build probabilistic motifs for binding
sites of transcription factors and design statistical
methods to assess the significance of recurring motifs and
combinatorial patterns of such motifs. The effective use
of multiple genome information to support these analyses
is a major research goal. We are working with several
collaborators who are generating expression profiles and
transcription factor binding location data to study
mammalian gene regulation.
3) Statistical learning and computation
In addition to employing existing statistical and
computing methods in biological applications, we continue
to develop new methods in these core methodological areas.
For example, we have developed a nonlinear cost function
for support-vector machine (SVM) type classifier. Our
method, called psi-learning, exploits recent advances in
Difference-of-Convex optimization and has provably
superior asymptotic rate than the standard SVM. In the
unsupervised learning area, we have recently introduced
Tight-Clustering, an algorithm that overcomes the
combinatorial search complexity to deliver very tight and
stable clusters of sizes 5-50. Our experience, from gene
expression studies, suggests that these tight clusters are
more suitable for biological interpretation than standard
hierarchical and K-means clustering.
One of our long-term objectives is to incorporate more
biological knowledge in the computational analysis of
genomics data. Despite heroic efforts by genome databases,
most of the knowledge is still embedded in text format in
the primary literature. As such, this knowledge is
difficult to use in computational analysis. We are
beginning to investigate approaches that allow more
efficient use of this knowledge. For example, we have
designed the knowledge-management software GeneNotes. This
program creates a database to keep track of relations,
sentences and paragraphs highlighted during online
browsing of abstracts or primary literature. Using natural
language processing techniques, the program can draw the
researcher’s attention to key phrases and sentences, and
can automatically detect binary relations. It can operate
in a learning-mode in order to improve its prediction
accuracy. The software also provides tools for visualizing
and interpreting the information captured by the database.
This software can interact with our expression analysis
software to enable novel approaches to study the
biological background of co-expressed gene clusters.
Our group has longstanding interest in Monte Carlo
simulation have contributed to the development of several
useful algorithms in this area: data-augmentation,
sequential imputation, dynamic weighting and evolutionary
MC. Recently we introduced a dual space approach using the
energy-temperature duality to design a sampler capable to
overcoming steep energy barriers and simultaneously
providing estimates of Boltzmann averages and density of
states. We are applying this “equi-energy sampling”
approach to the study of protein folding using simplified
energy function.
C) BIOLOGICAL INTEREST
The major areas in biology of interest to our laboratory
include developmental genomics and signal transduction
networks. We are as interested in advancing the biological
understanding, as in developing the computational analysis
methods. We have initiated an effort to study the
transcriptional program during early embryonic
development. Our approach uses the in vitro development of
mouse embryonic stem cells and microarray profiling of
FACS-purified cells. Furthermore, we have initiated
investigations on Hox gene regulation. Our interest in
finding regulatory elements through comparative genomics
meshes well with these projects. In the area of signal
transduction, we are working with Perrimon on the use RNA
interference to investigate the network of interactions
among kinases and phosphatases in Drosophila. We are also
investigating the use of novel experimental strategies,
such as periodic signal input, for the study of genetic
networks. |
|
|
|
|