Archive by Major Area Engineering Humanities Social Science Natural Science Archive by Year Fall 1999 - Spring 2000 Fall 2000 - Summer 2001 Fall 2001 - Spring 2002 Fall 2002 - Summer 2003

Studying Patterns of Genes That Affect Cancer Development

Pei Wang
Statistics
Stanford University
June 2003

I'm developing a statistical algorithm to study the genetic data of cancers. This work is trying to find out which genes are cancerous, and how they affect the cancer development. This will help us with early diagnoses and treatments for cancerous tumors.

DNA are big molecules in a cell which carry genetic information. A certain section of a DNA sequence is defined as a gene if it is in charge of some activity of the cell, by encoding proteins during specified time intervals. In other words, if we imagine a cell as a tiny factory, then the genes are the highest "command group" of this factory. They are the ones who decide what "goods" should be produced in the factory, and when these goods should be produced. Obviously if some "commanders" don't carry out their responsibilities, the factory will break down. For example, some cells may lose the ability to recognize the normal pace of reproduction or dying out. When these sick cells keep reproducing at a high speed, tumors are formed. And in some situations, these tumors are malignant and are called cancerous.

We say a "gene event" happens if a particular gene can not function normally. Scientists have found that cancer is always caused not by a single gene event, but by several gene events together. The also believe that there are certain patterns of gene events which are responsible for the development of a particular cancer.

Presently, new experimental techniques have enabled us to measure the "performance" of every gene in a given cell, with which we can assess whether there is a gene event happening to this gene or not. Then based on this data, my work is to use statistical methods to find out the patterns of gene events which are responsible for cancer development.

To manage this, I first set up a statistical model to describe the process of cancer development, and transform the problem into a statistical one. Then I develop a statistic algorithm to find out the underlying rule. The algorithm is a continuous learning process. Simply speaking, it tries to combine the "strength" of many weak "learners" that is fitted to the data, reaching an accurate result in the end.

Data is collected both from the normal and cancer tumor samples. Obviously the gene that plays a role in cancer development behaves differently in the two classes of samples. The algorithm tries to highlight these differences from the noisy background, and, in addition, to show how the patterns are different.

For example, after the data is putted into the algorithm, the final report may look like, "(Gene 1, Gene2, Gene3) is one group that causes the cancer; and the combination of Gene 1, Gene 4 and Gene 5 is another possible way." This means if Gene1, Gene2, and Gene3 can't function normally at the same time, then a normal cell would become a cancer cell; also if Gene1, Gene4 and Gene5 can't function normally at the same time, cancer would be present.

Currently, I've assessed the performance of the algorithm on simulated data, and the result is very encouraging. I am able to see the correct patterns of how genes affect cancer. The next step is to apply the algorithm to actual data, which will hopefully cast light on future studies in this field.