The
main assumption we are operating under is that the required information is
buried in chemists’ brains if only we can extract it. Therefore, the
initiative for the interaction must come from the system not the chemist, while
allowing the chemist the flexibility to supply additional information and to
modify the question sequence or content of the system. ...What we want to
design then is a question asking system [that] will gather rules about the
feasibility of the chemical molecules and their subgraphs being displayed.
[19] In
short, Feigenbaum sought to emulate a gifted chemist with the computer. That
chemist was Carl Djerassi, named “El Supremo” by his graduate
students and post-docs. Djerassi’s astonishing achievements as a mass
spectrometrist relied on his abilities to feel his way through the process
without the aid of a complete theory, relying rather on experience, tacit
knowledge, hunches, and rules of thumb. In interviews, Feigenbaum elicited this
kind of information from Djerassi, in a process that heightened awareness of
the structure of the field for both participants. The process of involving a
computer in chemical research, in this way, organized a variety of kinds of
information in a crucial step toward theory.
A
PARADIGM SHIFT IN BIOLOGY
Thus
far I have been considering efforts to predict structure from physical
principles as one of the paths through which computer science and
computer-based information technology began to reshape biology. The Holy Grail
of biology has always been the construction of a mathematized theoretical
biology, and for most molecular biologists the journey there has been directed
by the notion that the information for the three dimensional folding and
structure of proteins is uniquely contained in the linear sequence of their
amino acids.
[20]
As we have seen, the molecular dynamics approach assumed that if all the forces
between atoms in a molecule, including bond energies and electrostatic
attraction and repulsion, are known, then it is possible to calculate the
three-dimensional arrangement of atoms that requires the least energy. Because
this method requires intensive computer calculations, shortcuts have been
developed that combine computer-intensive molecular dynamics computations,
artificial intelligence and interactive computer graphics in deriving protein
structure directly from chemical structure.
While
theoretically elegant, the determination of protein structure from chemical and
dynamical principles has been hobbled with difficulties. In the abstract,
analysis of physical data generated from protein crystals, such as x-ray and
nuclear magnetic resonance data, should offer rigorous ways to connect primary
amino acid sequences to 3D structure. But the problems of acquiring good
crystals and the difficulty of getting NMR data of sufficient resolution are
impediments to this approach. Moreover, while quantum mechanics provides a
solution to the protein folding problem in theory, the computational task of
predicting structure from first principles for large protein molecules
containing many thousands of atoms has proven impractical. Furthermore, unless
it is possible to grow large, well-ordered crystals of a given protein, x-ray
structure determination is not an option. The development of methods of
structure determination by high resolution 2-D NMR has alleviated this
situation somewhat, but this technique is also costly and time-consuming,
requiring large amounts of protein of high solubility and severely limited by
protein size. These difficulties have contributed to the slow rate of progress
in registering atomic coordinates of macromolecules.
An
indicator of the difficulty of pursuing this approach alone is suggested by the
growth of databanks of atomic coordinates for proteins. The Protein Data Bank
(PDB) was established in 1971 as a computer-based archival resource for
macromolecular structures. The purpose of the PDB was to collect, standardize,
and distribute atomic co-ordinates and other data from crystallographic
studies. In 1977 the PDB listed atomic coordinates for 47 macromolecules.
[21]
In 1987 that number began to increase rapidly at a rate of about 10 percent per
year due to the development of area detectors and widespread use of synchrotron
radiation, so that by April 1990 atomic coordinate entries existed for 535
macromolecules. Commenting on the state of the art in 1990, Holbrook et al.
noted that crystal determination could require one or more man-years.
[22]
Currently (1999) the Biological Macromolecule Crystallization Database (BMCD)
of the Protein Data Bank contains entries for 2526 biological macromolecules
for which diffraction quality crystals have been obtained. These include
proteins, protein:protein complexes, nucleic acid, nucleic acid:nucleic acid
complexes, protein:nucleic acid complexes, and viruses.
[23] While
structure determination was moving at a snail’s pace, beginning in the
1970s another stream of work contributed to the transformation of biology as an
information science. The development of restriction enzymes, recombinant DNA
techniques, gene cloning techniques, and PCR was resulting in a flood of data
on DNA, RNA, and protein sequences. Indeed more than 140,000 genes were cloned
and sequenced in the twenty years from 1974 to 1994, of which more than 20
percent were human genes.
[24]
By the early 1990s, well before the beginning of the Human Genome Initiative,
the NIH GenBank database (release 70) contained more than 74,000 sequences,
while the Swiss Protein database (Swiss-Prot) included nearly 23,000 sequences.
Protein databases were doubling in size every 12 months, and some were
predicting that as a result of technological impact of the Human Genome
Initiative by the year 2000 ten million base pairs a day would be sequenced.
Such an explosion of data encouraged the development of a second approach to
determining function and structure of protein sequences: namely, prediction
from sequence data alone. This “bioinformatics” approach identifies
the function and structure of unknown proteins by applying search algorithms to
existing protein libraries in order to determine sequence similarity,
percentages of matching residues, and the statistical significance of each
database sequence.
A
key project illustrating the ways in which computer science and molecular
biology began to merge in the formation of bioinformatics was the
MOLGEN
project
at Stanford and events related to the formation and subsequent development of
BIONET.
MOLGEN
was
a continuation of the projects in artificial intelligence and knowledge
engineering begun at Stanford with
DENDRAL.
MOLGEN
was
started in 1975 as a project in the Heuristic Programming Project with Edward
Feigenbaum as principal investigator directing the thesis projects of Mark
Stefik and Peter Friedland.
[25]
The aim of
MOLGEN
was
to model the experimental design activity of scientists in molecular genetics.
[26]
Before an experimentalist sets out to achieve some goal, he produces a working
outline of the experiment, guiding each step of the process. The central idea of
MOLGEN
was
that in designing a new experiment scientists rarely plan from scratch. Instead
they find a skeletal plan, an overall design that has worked for a related or
more abstract problem, and then adapt it to the particular experimental
context. Similar to
DENDRAL
this approach is heavily dependent upon large amounts of domain-specific
knowledge in the field of molecular biology and especially upon good heuristics
for choosing among alternative implementations.
MOLGEN’s
designers chose molecular biology as appropriate for the application of
artificial intelligence because the techniques and instrumentation generated in
the 1970s seemed ripe for automation. The advent of rapid DNA cloning and
sequencing methods had had an explosive effect on the amount of data that could
be most readily represented and analyzed by a computer. Moreover, it appeared
that very soon progress in analyzing information in DNA sequences would be
limited by the appropriate combination of the available search and statistical
tools.
MOLGEN
was
intended to apply rules to detect profitable directions of analysis and to
reject unpromising ones.
[27] Peter
Friedland was responsible for constructing the knowledge-base component of
MOLGEN,
and though not himself a molecular biologist, he made a major contribution to
the field by assembling the rules and techniques of molecular biology into an
interactive, computerized system of analytical programs. Friedland worked with
Stanford molecular biologists Douglas Brutlag, Laurence Kedes, John Sninsky,
and Rosalind Grymes, who provided expert knowledge on enzymatic methods,
nucleic acid structures, detection methods, and pointers to key references in
all areas of molecular biology. Along with providing an effective encyclopedia
of information about technique selection in planning a laboratory experiment,
the knowledge base contained a number of tools for automated sequence analysis.
Brutlag, Kedes, Sninsky, and Grymes were interested in having a battery of
automated tools for sequence analysis, and they contracted with Friedland and
Stefik—both gifted computer program designers—to build them in
exchange for contributing their expert knowledge to the project.
[28]
This collaboration of computer scientists and molecular biologists helped
biology along the road to becoming an information science.
Among
the programs Friedland and Stefik created for
MOLGEN
was
SEQ,
an interactive self-documenting program for nucleic acid sequence analysis
which had 13 different procedures with over 25 different sub-procedures, many
of which could be invoked simultaneously to provide different analytical
methods for any sequence of interest.
SEQ
brought together in a single program methods for primary sequence analysis
described in the literature by Korn et al., Staden, and numerous others.
[29]
SEQ
also performed homology searches and specified the degree of homology and dyad
symmetry (inverted repeats) searches on DNA sequences.
[30]
Another feature of
SEQ
was its ability to prepare restriction maps with the names and locations of the
restriction sites marked on the nucleotide sequence in addition to having a
facility for calculating the length of DNA fragments from restriction digests
of any known sequence. Another program in the
MOLGEN
suite
was
GA1
(later called MAP). Constructed by Stefik,
GA1
was an artificial intelligence program that allowed generation of restriction
enzyme maps of DNA structures from segmentation data.
[31]
It would construct and evaluate all logical alternative models that fit the
data and rank them in relative order of fit. A further program in
MOLGEN
was
SAFE,
which aided in enzyme selection for gene excision.
SAFE
took amino acid sequence data and predicted the restriction enzymes guaranteed
not to cut within the gene.
In
its first phase of development (1977-1980)
MOLGEN
consisted
of the programs described above and a knowledge base containing information on
about 300 laboratory methods and 30 strategies for using them. It also
contained the best currently available data on about 40 common phages,
plasmids, genes, and other known nucleic acid structures. The second phase of
development beginning in 1980 scaled up both analytical tools and knowledge
base. Perhaps the most significant aspect of the second phase was making
MOLGEN
available
to the scientific community at large on the Stanford University Medical
Experimental national computer resource, SUMEX-AIM. SUMEX-AIM, supported by the
Biotechnology Resources Program at NIH since 1974, had been home to
DENDRAL
and
several other programs. The new experimental resource on SUMEX, comprising the
MOLGEN
programs
and access to all major genetic databases, was called
GENET.
In February 1980
GENET
was
made available to a carefully limited community of users.
[32] MOLGEN
and
GENET
were
immediate successes with the molecular biology community. In their first few
months of operation in 1979 more than 200 labs (with several users in each of
those labs) accessed the system. By November 1, 1982 more than 300 labs on the
system around the clock accessed the system from 100 institutions.
[33]
Traffic on the site was so heavy that restrictions had to be implemented and
plans for expansion considered. In addition to the academic users a number of
biotech firms, such as Monsanto, Genetech, Cetus, and Chiron, used the system
heavily. In order to insure that the academic community had unrestricted access
to the SUMEX computer and that the NIH would be satisfied commercial users were
not getting unfair access to the resource, Feigenbaum, principle investigator
in charge of the SUMEX resource, and Thomas Rindfleisch, facility manager,
decided to exclude commercial users.
[34]
To
provide commercial users with their own unrestricted access to
GENET
and
MOLGEN
programs,
Brutlag, Feigenbaum, Friedland, and Kedes formed a company, IntelliGenetics,
which would offer the suite of
MOLGEN
software
for sale or rental to the emerging biotechnology industry. With 125 research
labs doing recombinant DNA research in the US alone and a number of new genetic
engineering firms starting up, opportunities looked outstanding. No one was
currently supplying software in this rapidly growing genetic engineering
marketplace. With their exclusive licensing arrangement with Stanford for the
MOLGEN
software,
IntelliGenetics was poised to lead a huge growth area. The business plan
expressed well the excellent position of the company:
A
major key to the success of IntelliGenetics will be the fact that the
recombinant DNA research revolution is so recent. While every potential
customer is well capitalized, few have the manpower they say they need; this
year several firms are hiring 50 molecular geneticist Ph.D.s, and one company
speaks of 1000 within five years. These firms require computerized
assistance—for the storage and analysis of very large amounts of DNA
sequence information which is growing at an exponential rate—and will
continue to do so for the foreseeable future (10 years). Access to this
information and the ability to perform rapid and efficient pattern recognition
among these sequences is currently being demanded by most of the firms involved
in recombinant DNA research.
The
programs offered by IntelliGenetics will enable the researchers to perform
tasks that are: 1) virtually impossible to perform with hand calculations, and
2) extremely time-consuming and prone to human error.
In
other words, IntelliGenetics offers researcher productivity improvement to an
industry with expanding demand for more researchers which is experiencing a
severe supply shortage.
[35]The
resource that IntelliGenetics eventually offered to commercial users was
BIONET.
Like
GENET,
its prototype,
BIONET
combined in one computer site databases of DNA sequences with programs to aid
in their analysis.
Prior
to the startup of
BIONET,
GENET
was
not the only resource for DNA sequences. Several researchers were making their
databases available. Margaret Dayhoff had created a database of DNA sequences
and some software for sequence analysis for the National Biomedical Research
Foundation that was marketed commercially. Walter Goad, a physicist at Los
Alamos National Laboratory, collected DNA sequences from the published
literature and made them freely available to researchers. But by the late 1970s
the number of bases sequenced was already approaching 3 million and expected to
double soon. Some form of easy communication between labs and effective data
handling was considered a major priority in the biological community. While
experiments were going on with
GENET
a
number of nationally prominent molecular biologists had been pressing to start
a NIH-sponsored central repository for DNA sequences. An early meeting
organized by Joshua Lederberg was held in 1979 at Rockefeller University. The
proposed NIH initiative was originally supposed to be coordinated with a
similar effort at the European Molecular Biology Laboratory (EMBL) in
Heidelberg, but the Europeans became dissatisfied with the lack of progress on
the American end and decided to go ahead with their own databank. EMBL
announced the availability of its Nucleotide Sequence Data Library in April
1982, several months before the American project was funded. Finally, in
August, 1982 the NIH awarded a contract for $3 million over 5 years to the
Boston-based firm of Bolt, Berenek, and Newman (BB&N) to set up the
national database known as GenBank in collaboration with Los Alamos National
Laboratory. IntelliGenetics submitted an unsuccessful bid for that contract.
The
discussions leading up to GenBank included consideration of funding a more
ambitious databank, known as “Project 2,” which was to provide a
national center for the computer analysis of DNA sequences. Budget cuts forced
the NIH to abandon that scheme.
[36]
However, they returned to it the following year thanks to the persistence of
IntelliGenetics representatives. Although GenBank launched a formal national
DNA sequence collection effort, the need for computational facilities voiced by
molecular biologists was still left unanswered. In September 1983 after a
review process that took over a year, the NIH division of research resources
awarded IntelliGenetics a $5.6 million five year contract to establish
BIONET.[37]
The contract started on March 1, 1984 and ended on February 27, 1989.
BIONET
first became available to the research community in November 1984. The fee for
use was $400 per year per laboratory, and remained at that level throughout its
first five years.
BIONET’s
use grew impressively. Initially the IntelliGenetics team set the target for
user subscriptions at 250 labs. However the annual report for the first
year’s activities of
BIONET
in March, 1985 listed 350 labs with nearly 1132 users. By August 1985 that
number had increased dramatically to 450 labs and 1500 users.
[38]
In April 1986, for example,
BIONET
had 464 laboratories comprising 1589 users. By October 1986 the numbers were
495 labs and 1716 users.
[39]
By 1989 900 laboratories in the U.S., Canada, Europe, and Japan (comprising
about 2800 researchers) subscribed to
BIONET,
and 20 to 40 new laboratories joined each month.
[40] BIONET
was intended to establish a national computer resource for molecular biology
satisfying three goals. A first goal was to provide a way for academic
biologists to obtain access to computational tools to facilitate their nucleic
acid (and possibly protein) related research. In addition to giving researchers
ready access to national databases on DNA and protein sequences,
BIONET
would provide a library of sophisticated software for sequence searching,
matching, and manipulation. A second goal was to provide a mechanism to
facilitate research into improving such tools. The
BIONET
contract provided research and development support of further software, both
in-house research by IntelliGenetics scientists and through collaborative
ventures with outside researchers. A third goal of
BIONET
was to enhance scientific productivity through electronic communications.
The
stimulation of collaborative work through electronic communication was perhaps
the most impressive achievement of
BIONET.
BIONET
was much more than the Stanford
GENET
plus
the
MOLGEN-IntelliGenetics
suite of software. Whereas
GENET
with
its pair of ports could accommodate only two users at any one time,
BIONET
had 22 ports providing an estimated annual 30,000 connect hours.
[41]
All subscribers to
BIONET
were provided with email accounts. For most molecular biologists this was
something entirely new, since most university labs were just beginning to be
connected with regular email service. At least 20 different bulletin boards on
numerous topics were supported by
BIONET.
In an effort to change the culture of molecular biologists by accustoming them
to the use of electronic communications and more collaborative work,
BIONET
users were required to join one of the bulletin board groups.
BIONET
subscribers had access to the latest versions of the most important databases
for molecular biology. Large databases available at
BIONET
were (i) GenBank
tm,
the National Institutes of Health DNA sequence library; (ii) EMBL, the European
Molecular Biology Laboratory nucleotide sequence library; (iii) NBRF-PIR, the
National Biomedical Research Foundation's protein sequence database (this
database is part of the Protein Identification Resource [PIRI supported by
NIH's Division of Research Resources); (iv) SWISS-PROT, a protein sequence
database founded by Amos Bairoch of the University of Geneva and subsequently
managed and distributed by the European Molecular Biology Laboratory; (v)
VectorBank
tm,
IntelliGenetics' database of cloning vector restriction maps and sequences;
(vi) Restriction Enzyme Library, a complete list of restriction enzymes and
cutting sites provided by Richard Roberts at Cold Spring Harbor; and (vii)
Keybank, IntelliGenetics' collection of predefined patterns or "keys" for
database searching. Several smaller databases were also available, including a
directory of molecular biology databases, a collection of literature references
to sequence analysis papers, and a complete set of detailed molecular
biological laboratory protocols (especially for
E.
coli
and yeast work).
[42] Perhaps
the most important contribution made by
BIONET
to establishing molecular biology as an information science was negotiated at
the renewal of GenBank. As described above, BB&N was awarded the first
5-year contract to manage GenBank. The contract was up for renewal in 1987, and
given its track record in managing
BIONET,
IntelliGenetics submitted a proposal to manage GenBank. GenBank users had
become dissatisfied with the serious delay in sequence data publication.
GenBank was 2 years behind in disseminating sequence data it had received.
[43]
At a meeting in Los Alamos in 1986, Walter Goad noted that GenBank had 12
million base pairs. Other sequences available to researchers contained 14-15
million base pairs, so that GenBank was at least 14-20% out of date.
[44]
Concerned that researchers would turn to other, more up-to-date data sources,
the NIH listed use as one of the issues they wanted IntelliGenetics to address
in their proposal to manage GenBank.
[45] IntelliGenetics
proposed to solve this problem by automating the submission of gene and protein
sequences. Instead of laboriously searching the published scientific literature
for sequence data, rekeying them into a GenBank standard electronic format, and
checking them for accuracy, which was the standard method employed at that
time, IntelliGenetics would automate the submission procedure with an online
submission program,
XGENPUB
(later
called “
AUTHORIN”).
In
fact, IntelliGenetics was already progressing toward automating all levels of
sequence entry and (as much as possible) analysis. As early as 1986
IntelliGenetics included
SEQIN
in
PC/GENE,
its commercial software package designed for microcomputers.
SEQIN
was
designed for entering and editing nucleic acid sequences, and it already had
the functionality needed to deposit sequences with GenBank or EMBL
electronically.
[46]
Transferring this program to the mainframe was a straightforward move. Indeed
the online entry of original sequence data was already a feature of
BIONET,
since large numbers of researchers were using the IntelliGenetics
GEL
program
on the
BIONET
computer.
GEL
was
a program that accepted and analyzed data produced by all the popular
sequencing methods. It provided comprehensive record-keeping and analysis for
an entire sequencing project from start to finish. The final product of the
GEL
program
was a sequence file suitable for analysis by other programs such as
SEQ.
XGENPUB
added
a natural extension to this capability by allowing the scientist to annotate a
sequence according to the standard GenBank format and mail the sequence and its
annotation to GenBank electronically. The interface was a forms-oriented
display editor that would automatically insert the sequence in the appropriate
place in the form by copying the sequence from a designated file on the
BIONET
computer. When completed it could be forwarded to the GenBank computer at Los
Alamos, the National Institutes of Health DNA sequence library, EMBL, the
nucleotide sequence database from the European Molecular Biology Laboratory and
NBRF-PIR, the National Biomedical Research Foundation's protein sequence
database.
[47] Creating
a new culture requires both the carrot and the stick. Making the online
programs available and easy to use was one thing. Getting all molecular
biologists to use them was another. In order to doubly encourage molecular
biologists to comply with the new procedure of submitting their data online,
the major molecular biology journals agreed to require evidence that the data
had been submitted before they would consider a manuscript for review.
Nucleic
Acids Research
was the first journal to enforce this transition to electronic data submission.
[48]
With these new policies and networks in place,
BIONET
was able to reduce the time from submission to publication and distribution of
new sequence data from two years to 24 hours. As noted above, just a few years
earlier, at the beginning of
BIONET,
there were only 10 million base pairs published, and these had been the result
of several years’ effort. The new electronic submission of data generated
10 million base pairs a month.
[49]
Walter Gilbert may have angered some of his colleagues at the 1987 Los Alamos
Workshop on Automation in Decoding the Human Genome when he stated that,
“Sequencing the human genome is not science, it is production.”
[50]
But he surely had his finger on the pulse of the new biology.
THE
MATRIX OF BIOLOGY
The
explosion of data on all levels of the biological continuum made possible by
the new biotechnologies and represented powerfully by organizations such as
BIONET
was a source of both exhilaration and anxiety. Of primary concern to many
biologists was how best to organize this massive outpouring of data in a way
that would lead to deeper theoretical insight, perhaps even a unified
theoretical perspective for biology. The National Institutes of Health were
among those most concerned about these issues, and they organized a series of
workshops to consider the new perspectives emerging from recent developments.
The meetings culminated in a report chaired by Harold Morowitz entitled
Models
for Biomedical Research: A New Perspective
(1985). The panelists foresaw the emergence of a new theoretical biology
“different from theoretical physics, which consists of a small number of
postulates and the procedures and apparatus for deriving predictions from those
postulates.” The new biology was far more than just a collection of
experimental observations. Rather it was a vast array of information gaining
coherence through organization into a conceptual matrix.
[51]
A point in the history of biology had been reached where new generalizations
and higher order biological laws were being approached but obscured by the
simple mass of data and volume of literature. To move toward this new
theoretical biology the committee proposed a multi-dimensional matrix of
biological knowledge:
That is the complete data base of published biological experiments structured by the
laws, empirical generalizations, and physical foundations of biology and
connected by all the interspecific transfers of information. The matrix
includes but is more than the computerized data base of biological literature,
since the search methods and key words used in gaining access to that base are
themselves related to the generalizations and ideas about the structure of
biological knowledge.
[52]New
disciplinary requirements were imposed on the biologist who wanted to interpret
and use the matrix of biological knowledge:
The
development of the matrix and the extraction of biological generalizations from
it are going to require a new kind of scientist, a person familiar enough with
the subject being studied to read the literature critically, yet expert enough
in information science to be innovative in developing methods of classification
and search. This implies the development of a new kind of theory geared
explicitly to biology with its particular theory structure. It will be tied to
the use of computers, which will be required to deal with the vast amount and
complexity of the information, but it will be designed to search for general
laws and structures that will make general biology much more easily accessible
to the biomedical scientist.
[53]Similar
concerns about managing the explosion of new information motivated the Board of
Regents of the National Library of Medicine. In its Long Range Plan of 1987 the
NLM drew directly on the notion of the matrix of biological knowledge and
elaborated upon it explicitly in terms of fashioning the new biology as an
information science.
[54]
The Long Range Plan contained a series of recommendations that were the outcome
of studies done by five different panels, including a panel that considered
issues connected with building factual databases, such as sequence databases.
In
the view of the panel, the field of molecular biology is opening the door to an
era of unprecedented understanding and control of life processes, including
“automated methods now available to analyze and modify biologically
important macromolecules.”
[55]
The report characterized biomedical databases as representing the universal
hierarchy of biological nature: cells, chromosomes, genes, proteins. Factual
databases were being developed at all levels of the hierarchy from cells to
base-pair sequences. Because of the complexity of biological systems, basic
research in the life sciences is increasingly dependent on automated tools to
store and manipulate the large bodies of data describing the structure and
function of important macromolecules. But, the NIH Long Range Plan stated,
although the critical questions being asked can often only be answered by
relating one biological level to another, methods for automatically suggesting
links across levels are non-existent.
[56] A
singular and immediate window of opportunity exists for the Library in the area
of molecular biology information. Because of new automated laboratory methods,
genetic and biochemical data are accumulating far faster than they can be
assimilated into the scientific literature. The problems of scientific research
in biotechnology are increasingly problems of information science. By applying
its expertise in computer technologies to the work of understanding the
structure and function of living cells on a molecular level, NLM can assist and
hasten the Nation’s entry into a remarkable new age of knowledge in the
biological sciences.
[57] To
support and promote the entry into the new age of biological knowledge the NIH
recommended building a National Center for Biotechnology Information to serve
as a repository and distribution center for this growing body of knowledge and
as a laboratory for developing new information analysis and communications
tools essential to the advance of the field. The proposal recommended $12.75
mil per year for 1988-1990, with an additional $10 mil per year for work in
medical informatics.
[58]
The program would emphasize collaboration between computer and information
scientists and the biomedical researcher. In addition the NIH would support
research in the areas of molecular biology database representation,
retrieval-linkages, and modeling systems, while examining interfaces based on
algorithms, graphics and expert systems. The recommendation also called for the
construction of online data delivery through linked regional centers and
distributed database subsets.
BRAVE
NEW THEORY
Two
different styles of work have characterized the field of molecular biology. The
biophysical approach has sought to predict the function of a molecule from its
structure. The biochemical approach, on the other hand, has been concerned with
predicting phenotype from biochemical function. If there has been a unifying
framework for the field, at least from its early days up through the 1980s, it
was provided by the “central dogma” emerging from the work of
Watson, Crick, Monod, and Jacob in the late 1960s, schematized as follows:
|
DNA
|
→
|
RNA
|
→
|
Protein
|
→
|
Function
|
In
this paper I have singled out molecular biologists whose Holy Grail has always
been to construct a mathematized, predictive biological theory. In terms of the
“central dogma” the measure of success in the enterprise of making
biology predictive would be—and has been since the days of Claude
Bernard—rational medicine. If one had a complete grasp of all the levels
from DNA to behavioral function including the processes of translation at each
level, then one could target specific proteins or biochemical processes that
may be malfunctioning and design drugs specifically to repair these disorders.
For molecular biologists with high theory ambitions the preferred path toward
achieving this goal has been based on the notion that the function of a
molecule is determined by its three-dimensional folding and that the structure
of proteins is uniquely contained in the linear sequence of their amino acids.
[59]
But determination of protein structure and function is only part of the problem
confronting a theoretical biology. A fully-fledged theoretical biology would
want to be able to determine the biochemical function of the protein structure
as well as its expected behavioral contribution within the organism. Thus
biochemists have resisted the road of high theory and have pursued a solidly
experimental approach aimed at eliciting common models of biochemical function
across a range of mid-level biological structures from proteins and enzymes
through cells. Their approach has been to identify a gene by some direct
experimental procedure—determined by some property of its product or
otherwise related to its phenotype—to clone it, to sequence, it, to make
its product and to continue to work experimentally so as to seek an
understanding of its function. This model, as Walter Gilbert has observed, was
suited to “small science,” experimental science conducted in a
single lab.
[60]
The
emergence of organizations like the Brookhaven Protein Data Bank in 1971,
GenBank in 1982, and
BIONET
in 1984, and the massive amount of sequencing data that began to become
available in university and company databases, and more recently publicly
through the Human Genome Initiative, have complicated this picture immensely
through an unprecedented influx of new data. In the process a paradigm shift
has occurred in both the intellectual and institutional structures of biology.
According to some of the central players in this transformation, at the core is
biology’s switch from having been an observational science, limited
primarily by the ability to make observations, to being a data-bound science
limited by its practitioner’s ability to understand large amounts of
information derived from observations. To understand the data the tools of
information science have not only become necessary handmaidens to theory: they
have also fundamentally changed the picture of biological theory itself. A new
picture of theory radically different from even the biophysicists’ model
of theory has come into view. Disciplinarily, biology has become an information
science. Institutionally it is becoming “big science.” Gilbert
characterizes the situation sharply:
To use this flood of knowledge, which will pour across the computer networks of
the world, biologists not only must become computer-literate, but also change
their approach to the problem of understanding life.
The
next tenfold increase in the amount of information in the databases will divide
the world into haves and have-nots, unless each of us connects to that
information and learns how to sift through it for the parts we need.
[61] The
new data-bound biology Gilbert hints at in this scenario is genomics, the
theoretical component of which might be termed “computational
biology,” while its instrumental and experimental component might be
considered as “bioinformatics.” The fundamental dogma of this new
biology, as characterized by Douglas Brutlag, reformulates the central dogma of
Jacob-Monod in terms of “information flow”:
[62]
|
Genetic
information
|
→
|
Molecular
structure
|
→
|
Biochemical
function
|
→
|
Biologic
behavior
|
Walter
Gilbert describes the newly forming genomic view of biology:
The new paradigm now emerging is that all the “genes” will be known (in
the sense of being resident in databases available electronically), and that
the starting point of a biological investigation will be theoretical. An
individual scientist will begin with a theoretical conjecture, only then
turning to experiment to follow or test that hypothesis. The actual biology
will continue to be done as “small science”—depending on
individual insight and inspiration to produce new knowledge—but the
reagents that the scientist uses will include a knowledge of the primary
sequence of the organism, together with a list of all previous deductions from
that sequence.
[63]Genomics,
computational biology, and bioinformatics restructure the playing field of
biology bringing a substantially modified toolkit to the repertoire of
molecular biology skills developed in the 1970s. Along with the biochemistry
components new skills are now required, including machine learning, robotics,
databases, statistics and probability, artificial intelligence, information
theory, algorithms, and graph theory.
[64] Proclamations
of the sort made by Gilbert and other promoters of genomics may seem like
hyperbole. But the Human Genome Initiative and the information technology that
enables it has changed molecular biology in fundamental ways, and indeed, may
suggest similar changes in store for other domains of science. The online DNA
and protein databases I have described have not just been repositories of
information for insertion into the routine work of molecular biology, and the
software programs discussed in connection with IntelliGenetics and GenBank are
more than retrieval aids for transporting that information back to the lab. As
a set of final reflections, I want to look in more detail at some ways this
software has been used to address the problems of molecular biology in order to
gain a sense of the changes taking place.
BIOLOGY
IN SILICO
To
appreciate the relationship between genomics and earlier work in molecular
biology it is useful to compare approaches to the determination of structure
and function. Rather than an approach deriving structure and function from
first principles of the dynamics of protein folding, the bioinformatics
approach involves comparing new sequences with preexisting ones and discovering
structure and function by homology to known structures. This approach examines
the kinds of amino acid sequences or patterns of amino acids found in each of
the known protein structures. The sequences of proteins whose structure have
already been determined and are already on file in the PDP are examined to
infer rules or patterns applicable to novel protein sequences to predict their
structure. For instance, certain amino acids, such as leucine and alanine, are
very common in α-helical regions of proteins, whereas other amino acids,
such as proline, are rarely if ever found in α-helices. Using patterns of
amino acids or rules based on these patterns, the genome scientist can attempt
to predict where helical regions will occur in proteins whose structure is
unknown and for which a complete sequence exists. Clearly the lineage in this
approach is work on automated learning first begun in
DENDRAL
and
carried forward in other AI projects related to molecular biology such as
MOLGEN.
The
great challenge in the study of protein structure has been to predict the fold
of a protein segment from its amino acid sequence. Before the advent of
sequencing technology it was generally assumed that every unique protein
sequence would produce a three-dimensional structure radically different from
every other protein. But the new technology revealed that protein sequences are
highly redundant: only a small percentage of the total sequence is crucial to
the structure and function of the protein. Moreover, while similar protein
sequences generally indicate similar folded conformations and functions, the
converse does not hold. In some proteins, such as the nucleotide-binding
proteins, the structural features encoding a common function are conserved
while primary sequence similarity is almost non-existent.
[65]
Methods that detect similarities solely at the primary sequence level turned
out to have difficulty addressing functional associations in such sequences. A
number of features often only implicit in the protein’s primary sequence
of amino acids turned out to be important in determining structure and function.
Such
findings implied the need for more sophisticated techniques of searching than
simply finding identical matches between sequences in order to elicit
information about similarities between higher ordered structures such as folds.
One solution adopted early on by programs such as
SEQ
was
to assume that if two DNA segments are evolutionarily related, their sequences
will probably be related in structure and function. The related descendants are
identifiable as homologues. For instance, there are more than 650 globin
sequences in the protein sequence databases, all of them very similar in
structure. These sequences are assumed to be related by evolutionary descent
rather than having been created de novo. Many programs for searching sequence
databases have been written, including an important early method written in
1970 by Needleman and Wunsch and incorporated into
SEQ
for
aligning sequences based on homologies.
[66]
The method of homology depends upon assumptions related to the genetic events
that could have occurred in the divergent (or convergent) evolution of
proteins; namely, that homologous proteins are the result of gene duplication
and subsequent mutations. If one assumes that following the duplication, point
mutations occur at a constant or variable rate, but randomly along the genes of
the two proteins, then after a relatively short period of time the protein
pairs will have nearly identical sequences. Later there will be gaps in the
shared sets of base-pairs between the two proteins. Needleman and Wunsch
determined the degree of homology between protein pairs by counting the number
of non-identical pairs (amino acid replacements) in the homologous comparison
and using this number as a measure of evolutionary distance between the amino
acid sequences. A second approach was to count the minimum number of mutations
represented by the non-identical pairs.
Another
example of a key tool used in determining structure-function relationship is a
search for sequences that correspond to small conserved regions of proteins,
modular structures known as motifs.
[67]
Several different kinds of motifs are related to secondary and tertiary
structure. Protein scientists distinguish among four hierarchical levels of
structure. Primary structure is the specific linear sequence of the 20 possible
amino acids making up the building blocks of the protein. Secondary structure
consists of patterns of repeating polypeptide structure within an α-helix,
β-sheet, and reverse turns. Supersecondary structure refers to a few
common motifs of interconnected elements of secondary structure. Segments of
α-helix and β-strand often combine in specific structural motifs. One
example is the α-helix-turn-helix motif found in DNA-binding proteins.
This motif contains 22 amino acids in length that enable it to bind to DNA.
Another motif at the supersecondary level is known as the Rossmann fold, in
which three α-helices alternate with three parallel β strands. This
has turned out to be a general fold for binding mono-or dinucleotides, and is
the most common fold observed in globular proteins.
[68]
A
higher order of modular structure is found at the tertiary level. Tertiary
structure is the overall spatial arrangement of the polypeptide chain into a
globular mass of hydrophobic side chains forming the central core, from which
water is excluded, and more polar side chains favoring the solvent-exposed
surface. Within tertiary structures are certain domains on the order of 100
amino acids, which are themselves structural motifs. Domain motifs have been
shown to be encoded by exons, individual DNA sequences that are directly
translated into peptide sequences. Assuming that all contemporary proteins have
been derived from a small number of original ones, Walter Gilbert, et al. have
argued that the global number of exons from which all existing protein domains
have been derived is somewhere between 1,000 and 7,000.
[69]
Motifs
are powerful tools for searching databases of known structure and function to
determine the structure and function of an unknown gene or protein. The motif
can serve as a kind of probe for searching the database or some new sequence,
testing for the presence of that motif. The PROCITE database, for example, has
more than 1000 of these motifs.
[70]
With such a library of motifs one can take a new sequence and use each one of
the motifs to get clues to its structure. Suppose the sequence of a protein has
been determined. The most common way to examine a new gene or protein for its
biologic function is simply to compare its sequence with all known DNA or
protein sequences in the databases and note any strong similarities. The
particular gene or protein that has just been determined will of course not be
found in the databases, but a homologue from another organism or a gene or
protein having a related function may be found.
[71]
In either case the evolutionary similarity implies a common ancestor and hence
a common function. Searching with motif probes refines the determination of the
fold regions of the protein. These methods become more and more successful as
the databases grow larger and as the sensitivity of the search procedure
increases.
The
all-or-nothing character of consensus sequences—a sequence either matches
or it doesn’t—led researchers to modify this technique to introduce
degrees of similarity among aligned sequences as a way of detecting
similarities between proteins, even distantly related ones. Knowing the
function of a protein in some genome, such as
E.
coli
,
for instance, might suggest the same function of a closely related protein in
an animal or human genome.
[72]
Moreover, as noted above, different amino acids can fit the same pattern, such
as the helix-turn-helix, so that a representation of sequence pattern in which
alternate amino acids are acceptable, as well as regions in which a variable
number of amino acids may occur are desirable ways of extending the power of
straightforward consensus sequence comparison. One such technique is to use
weights or frequencies to specify greater tolerance in some positions than in
others. An illustration of the success of this approach is provided by the DNA
binding proteins mentioned above, which contain a helix-turn-helix motif 22
acids in length.
[73]
Comparison of the linear amino acid sequences of these proteins revealed no
consensus sequence that could distinguish them from any other protein. By
determining the frequency with which each amino acid appears at each position,
and then converting these numbers to a measure of the probability of occurrence
of each acid a weight matrix is constructed. This weight matrix can be applied
to measure the likelihood that any given sequence 22 amino acids long is
related to the helix-turn-helix family. A further modification of the weight
matrix is the profile, which allows one to estimate the probability that any
amino acid will appear in a specific position.
[74] In
addition to consensus sequences, weight matrices, and profiles, a further class
of strategies for determining structure-function relations are various sequence
alignment methods. In order to detect homologies between distantly related
proteins one method is to assign a measure of similarity to each pair of amino
acids, and then add up these pairwise scores for the entire alignment.
[75]
Related proteins will not have identical amino acids aligned, but they do have
chemically similar or replaceable amino acids in similar positions. In a
scoring method developed by Schwartz and Dayhoff, for example, amino acid pairs
that are identical or chemically similar were given positive scores, and pairs
of amino acids that are not related were assigned negative similarity scores.
A
dramatic illustration of how sequence alignment tools can be brought to bear on
determining function and structure is provided by the case of cystic fibrosis.
Cystic fibrosis is caused by aberrant regulation of chloride transport across
epithelial cells in the pulmonary tree, the intestine, the exocrine pancreas,
and apocrine sweat glands. This disorder was identified as due to defects in
the cystic fibrosis transmembrane conductance regulator protein (CFTR). The
CFTR gene was isolated in 1989, and subsequently identified as producing a
chloride channel whose activity depends on phosphorylation of particular
residues within the regulatory region of the protein. Using computer-based
sequence alignment tools of the sort described above, it was established that a
consensus sequence for nucleotide binding folds that bind ATP are present near
the regulatory region and that 70 percent of cystic fibrosis mutations are
accounted for by a 3 base-pair deletion that removes a phenylalanine residue
within the first nucleotide binding position. A significant portion of the
remainder of cystic fibrosis mutations affect a second nucleotide-binding
domain near the regulatory region.
[76]
In
working out the folds and binding domains for the CFTR protein Hyde, Emsley,
Hartshorn, et al. (1990) used sequence alignment methods similar to those
available in early models of the IntelliGenetics software suite.
[77]
In 1992 IntelliGenetics introduced
BLAZE,
an even more rapid search program running on a massively parallel computer. As
an example of how computational genomics can be used to solve
structure-function problems in molecular biology, Brutlag repeated the CFTR
case using
BLAZE.[78]
A sequence similarity search compared the CFTR protein to more than 26,000
proteins in a protein database of more than 9 million residues, resulting in a
list of 27 top similar proteins, all of which strongly suggested the CFTR
protein is a membrane protein involved in secretion. Another feature of the
comparison result was that significant homologies were shown with ATP-binding
transport proteins, further strengthening the identification of CFTR as a
membrane protein. The search algorithm identified two consensus sequence motifs
in the protein sequence of the cystic fibrosis gene product that corresponded
to the two sites on the protein involved in binding nucleotides.
The
search also turned up distant homologies between the CFTR protein and proteins
of
E.
coli
and yeast. The entire search took three hours. Such examples offer convincing
evidence that tools of computational molecular biology can lead to the
understanding of protein function.
The
methods for analyzing sequence data discussed above were just the beginnings of
an explosion of database mining tools for genomics that is continuing to take
place.
[79]
In the process biology is becoming even more aptly characterized as an
information science.
[80]
Advances in the field have led to large-scale automation of sequencing in
genome centers employing robots. The success this large-scale sequencing of
genes has enjoyed has in turn spawned a similar approach to applying automation
to sequencing proteins, a new area complementary to genomics called proteomics.
Similar in concept to genomics, which seeks to identify all genes, proteomics
aims to develop techniques that can rapidly identify the type, amount and
activities of the thousands of proteins in a cell. Indeed, new biotechnology
companies have started marketing technologies and services for mining protein
information en masse. Oxford Glycosciences (OGS) in Abingdon, England, has
automated the laborious technique of two-dimensional gel electrophoresis.
[81]
In the OGS process, an electric current applied to a sample on a polymer gel
separates the proteins, first by their unique electric charge characteristics
and then by size. A dye attaches to each separated protein arrayed across the
gel. Then a digital imaging device automatically detects protein levels by how
much the dye fluoresces. Each of the 5,000 to 6,000 proteins that may be
assayed in a sample in the course of a few days is channeled through a mass
spectrometer that determines its amino acid sequence. The identity of the
protein can be determined by comparing the amino acid sequence with information
contained in numerous gene and protein databases. One imaged array of proteins
can be contrasted with another to find proteins specific to a disease.
In
order to keep pace with this flood of data emerging from automated sequencing,
genome researchers have in turn looked increasingly to artificial intelligence,
machine learning, and even robotics in developing automated methods for
discovering patterns and protein motifs from sequence data. The power of these
methods is their ability both to represent structural features rather than
strictly evolutionary steps and to discover motifs from sequences
automatically. The methods developed in the field of machine learning have been
used to extract conserved residues, discover pairs of correlated residues, and
find higher order relationships between residues as well. Techniques from the
field of machine learning have included perceptrons, discriminant analysis,
neural networks, Bayesian networks, hidden Markov models, minimal length
encoding, and context-free grammars.
[82]
Important methods for evaluating and validating novel protein motifs have also
derived from the machine learning area.
An
example of this effort to scale up and automate the discovery of structure and
function is
EMOTIF
(for
“electronic-motif”), a program for discovering conserved sequence
motifs from families of aligned protein sequences developed by the Brutlag
Bioinformatics Group at Stanford.
[83]
Protein sequence motifs are usually generated manually with a single
“best” motif optimized at one level of specificity and sensitivity.
Brutlag’s aim was to automate this procedure. An automated method
requires knowledge about sequence conservation. For
EMOTIF,
this knowledge is encoded as a particular allowed set of amino acid
substitution groups. Given an aligned set of protein sequences,
EMOTIF
works
by generating a set of motifs with a wide range of specificities and
sensitivities.
EMOTIF
can
also generate motifs that describe possible subfamilies of a protein
superfamily. The
EMOTIF
program
works by generating a new database, called
IDENTIFY,
of
50,000 motifs from the combined 7000 protein alignments in two widely used
public databases, the
PRINTS
and
BLOCKS
databases.
By changing the set of substitution groups the algorithm can be adapted for
generating entirely new sets of motifs.
Highly
specific motifs are well suited for searching entire proteomes.
IDENTIFY
assigns biological functions to proteins based on links between each motif and
the
BLOCKS
or
PRINTS
databases that describe the family of proteins from which it was derived.
Because these protein families typically have several members, a match to a
motif may provide an association with several other members of the family. In
addition, when a match to a motif is obtained, that motif may be used to search
sequence databases, such as
SWISS-PROT
and
GENPEPT,
for other proteins that share this motif. In their paper introducing these new
programs Nevill-Manning, Wu, and Brutlag showed that
EMOTIF
and
IDENTIFY
successfully
assigned functions automatically to 25-30% of the proteins in several bacterial
genomes and automatically assigned functions to 172 proteins of previously
unknown function in the yeast genome.
Many
molecular biologists who welcomed the Human Genome Initiative with open arms
undoubtedly believed that when the genome was sequenced everyone would return
to the lab to conduct their experiments in a business-as-usual fashion,
empowered with a richer set of fundamental data. The developments in
automation, the resulting explosion of data, and the introduction of tools of
information science to master this data have changed the playing field forever:
there may be no “lab” to return to. In its place is a workstation
hooked to a massively parallel computer, producing simulations by drawing on
the data streams of the major databanks and carrying out
“experiments”
in
silico
rather than
in
vitro
.
The result of biology’s metamorphosis into an information science just
may be the relocation of the lab to the industrial park and the dustbin of
history.
ENDNOTES
[1]
See Richard Mark Friedhoff and William Benzon,
The
Second Computer Revolution: Visualization
,
New York: W.H. Freeman, 1989. Other important discussions are in
Information
Technology and the Conduct of Research: The User’s View
,
Report of the Panel on Information Technology and the Conduct of Research,
National Academy of Sciences, 1989; B.H. McCormick, T.A. DeFanti, and M.D.
Brown,
Visualization
in Scientific Computing
,
NSF Report, published as a special issue of
Computer
Graphics
,
Vol. 21 (6) (1987). An equally impressive survey is the special issue on
computational physics in
Physics
Today
,
October, 1987. See especially the articles by Norman Zambusky, “Grappling
with Complexity,”
Physics
Today
,
October, 1987, pp. 25-27; Karl-Heinz A. Winkler, et al., A Numerical
Laboratory, ibid., pp. 28-37; Martin Karplus, “Molecular Dynamics
Simulations of Proteins,” ibid., pp. 68-72.
By
“computational science” I mean the use of computers in the
discipline sciences as distinct from computer science. The discipline sciences
include the problem domains of the physical and life sciences, economics,
medicine, much of applied mathematics, and so forth. See McCormick, et al.
“Visualization in Scientific Computing,” p. 11.
[2]
The sciences of visualization are defined by McCormick, et al.,
“Visualization in Scientific Computing,” p. A-1, as follows:
“Images and signals may be captured from cameras or sensors, transformed
by image processing, and presented pictorially on hard or soft copy output.
Abstractions of these visual representations can be transformed by computer
vision to create symbolic representations in the form of symbols and
structures. Using computer graphics, symbols or structures can be synthesized
into visual representations. It should be noted that computational science,
simulation, and visualization need not be interwoven.” Early workers in
computational physics, for example, devoted their efforts to the underlying
physics for simulating phenomena with little emphasis on visualization (Ullam
and Teller, for instance, as discussed by Peter Galison in chapter 9 of Image
and Logic.). In recent years, however, numerical simulations in the physical
sciences have reached a degree of complexity such that they are
incomprehensible without visual representations. See the remarkable statements
of several leading scientists in the supercomputing field quoted in the
abstract from the session on “The Physical Simulation and Visual
Representation of Natural Phenomena,” from the 1987 meeting of the
Association for Computing Machinery, in
Computer
Graphics
,
Vol 21, no. 4 (1987), pp. 335-336.
[3]
Stephen
S. Hall, “Protein Images Update Natural History.”
Science
267(3 February) (1995): 620-624 points to the discovery of the structure of HGH
(human growth hormone) receptors as an example of a structure discovery that
could not have been accomplished without high-powered interactive graphics.
[4]
Richard Doyle,
On
Beyond Living : Rhetorical Transformations of The Life Sciences
,
Stanford; Stanford University Press, 1997.
[5]
Diane M. Ramsey, ed.,
Image
Processing in Biological Science
(Proceedings of a Conference held November, 1966), Berkeley and Los Angeles:
University of California Press, 1968, pp. xiii-xiv.
[6]
See Lusted’s foreward to Robert S. Ledley,
Use
of Computers in Biology and Medicine
,
New York: McGraw-Hill, 1965, p. ix. This book was written as part of the
Committee’s recommendation. At the time it was written Ledley was
affiliated with the Division of Medical Sciences, National Academy of Sciences
National Research Council. Lusted went on to make the following recommendation:
3. The committee feels that there is a need for biomedical computer research
centers, which might be established at national research laboratories or in
association with academic institutions. I believe that a pilot project should
be made available as soon as possible to determine the utility of such centers.
In our judgment the purpose of these centers would be to cooperate in
large-scale biomedical research projects that utilize computers as a necessary
adjunct and to make computer facilities available for the use of biomedical
research workers from other institutions. Precedents for such large-scale
cooperative efforts have already been set in basic physics and other areas of
science. (ibid., pp. ix-x.)
[7]
Robert S. Ledley, Use
of
Computers in Biology and Medicine
,
p. xi.
[8]
Illustrated in
Progress
in Stereochemistry
,
4(1968).
[9]
See Cyrus Levinthal, “Molecular Model-Building by Computer,”
Scientific
American
,
Vol. 214, no. 6, 1966, pp. 42-52.
[10]
Conventions for the description of protein molecules are provided in J.T.
Edsall, P.K. Flory, J.C. Kendrew, A.M. Ligouri, G. Nemathy, G.N Ramachandra,
and H.A. Scherga, Journal of Molecular Biology, Vol. 15 (1966), p. 399. Methods
for working with computer input are described by G. Nemathy and H.A. Scherga,
Biopolymers, Vol. 3 (1966), p.155.
[11]
Anthony G. Oettinger, “The Uses of Computers in Science,”
Scientific
American
,
Vol. 215, no. 3, 1966, pp. 161-172, quoted from p. 161.
[12]
Cyrus Levinthal, “Molecular Model-Building by Computer,” pp. 48-49.
[13]
Lou Katz and Cyrus Levinthal, "Interactive Computer Graphics and the
Representation of Complex Biological Structures," Annual Reviews in Biophysics
and Bioengineering, 1(1972): 465-504.
[14]
Further discussion of this aspect of the model-building program can be found in
C. Levinthal, C.D. Barry, S.A. Ward, and M. Zwick, “Computer Graphics in
Macromolecular Chemistry,” in D. Secrest and J. Nievergelt, eds.,
Emerging
Concepts in Computer Graphics,
New
York: W.A. Benjamin, 1968, pp.231-253.
[15]
Stephen S. Hall, “Protein Images Update Natural History.”
Science
267(3 February) (1995): 620-624 points to the discovery of the structure of HGH
(human growth hormone) receptors as an example of a structure discovery that
could not have been accomplished without high-powered interactive graphics.
[16]
Joshua Lederberg, “How DENDRAL Was Conceived and Born,” Stanford
Technical Reports. 048087-54, Knowledge Systems Laboratory Report No. KSL 87-54.
Joshua
Lederberg, Georgia L. Sutherland, Bruce G. Buchanan, Edward Feigenbaum,
“A Heuristic Program for Solving a Scientific Inference Problem: Summary
of Motivation and Implementation,” Stanford Technical Reports 026104,
Stanford Artificial Intelligence Project Memo AIM-104, November, 1969, p. 2.
[17]
Joshua Lederberg, “Topology of Molecules,” in
The
Mathematical Sciences: A Collection of Essays
,
edited by the National Research Council Committtee on Support of Research in
the Mathematical Sciences, Cambridge, Mass., MIT Press, 1969, pp. 37-51, quoted
from pp. 37-38.
[18]
See Joshua Lederberg, Georgia L. Sutherland, Bruce G. Buchanan, and Edward A.
Feigenbaum, “A Heuristic Program for soving a Scientific Inference
Problem: Summary of Motivation and Implementation,” Stanford Artificial
Intelligence Project Memo AIM-104, November, 1969.
[19]
”Second Cut at Interaction Language and Procedure,” in
“Chemistry Project” Edward Feigenbaum Papers, Stanford Special
Collections SC-340, Box 13.
[20]
See Christian B. Anfinsen, “Principles that Govern the Folding of Protein
Chains.”
Science
(1973). 181(Number 4096): 223-230 discusses the work for which he was awarded
the Nobel Prize for Chemistry in 1972: “This hypothesis (the
“thermodynamic hypothesis”) states that the three-dimensional
structure of a native protein in its normal physiological milieu...is the one
in which the Gibbs free energy of the whole system is lowest; that is, that the
totality of interatomic interactions and hence by the amino acid sequence, in a
given environment.” (p. 223)
[21]
Bernstein, F. C., T. F. Koetzle, et al. (1977). “The Protein Data Bank: A
computer based archival file for macromolecular structure.”
Journal
of Molecular Biology
112: 535-542.
[22]
Holbrook, S. R., S. M. Muskal, et al. (1993). Predicting Protein Structural
Features with Artificial Neural Networks.
Artificial
Intelligence and Molecular Biology
.
L. Hunter, ED. Menlo Park, CA, AAAI Press: 161-194.
[24]
D. Brutlag, Understanding the Human Genome.
Scientific
American Introduction to Molecular Medicine
.
P. Leder, D. A. Clayton and E. Rubenstein, ED. New York, NY, Scientific
American, Inc., 1994: p. 159.
[25]
E. A Feigenbaum and N. Martin
,
Proposal: MOLGEN - a computer science application to molecular genetics
,
Heuristic Programing Project, Stanford University, Technical Report No:
HPP-78-18,1977.
[26]
P. Friedland,
Knowledge-Based
Experiment Design in Molecular Genetics
.
Ph.D. Thesis, Computer Science, Stanford University, Stanford,1979.
[27]
E. A., Feigenbaum, B. Buchanan, et al.
A
Proposal for Continuation of the MOLGEN Project: A Computer Science Application
to Molecular Biology
,
Computer Science Department, Stanford University, Heuristic Programming
Project, Technical Report No. HPP-80-5, April, 1980, Section 1., p.1.
[28]
Douglas Brutlag, personal communication. Peter Friedland, personal
communication. After his work on MOLGEN and at IntelliGenetics (discussed
below) Friedland went on to become chief scientist at the NASA-Ames Laboratory
for Artificial Intelligence in 1987.
[29]
L.J. Korn, C.L. Queen, and M.N. Wegman. “Computer Analysis of Nucleic
Acid Regulatory Sequences.”
Proceedings
of the National Academy of Sciences
74
(1977): 4516-4520; R. Staden. “Sequence Data Handling by Computer.”
Nucleic
Acids Research
4
(1977): 4037-4051; R. Staden. “Further Procedures for Sequence Analysis
by Computer.”
Nucleic
Acids Research
5
(1978): 1013-1015; R. Staden. “A Strategy of DNA Sequencing Employing
Computer Programs.”
Nucleic
Acids Research
6
(1979): 2602-2610.
[30]
P. Friedland, D.L. Brutlag, J. Clayton, and L.H. Kedes. “SEQ: A
Nucleotide Sequence Analysis and Recombinant System.”
Nucleic
Acids Research
10
(1982): 279-294.
[31]
M. Stefik. “Inferring DNA Structures from Segmentation Data.”
Artificial
Intelligence
11
(1977): 85-114.
[32]
T. Rindfleisch, P. Friedland, and J. Clayton.
The
GENET Guest Service on SUMEX,
SUMEX-AIM
Report, 1981: Stanford University Special Collections, Friedland Papers, Fldr
GENET.
[33]
Doug Brutlag, Personal Communication, 6/19/99. Also discussed in the official
site review for BIONET conducted by the NIH Special Study Section, March 17-19,
1983, “BIONET, National Computer Resource for Molecular Biology,”
Stanford University Special Collections, Brutlag Papers, p. 2. Also discussed
in Roger Lewin, “National Networks for Molecular Biologists,”
Science
223
(1984): 1379-1380.
[34]
This was announced to the GENET community by Allan Maxam, the chairman of the
national advisory board. See: Allan M. Maxam to GENET Community. Subject:
Closing of GENET: August 23,1982. Stanford University Special Collections,
Peter Friedland Papers, Fldr GENET.
[35]
Business Plan for IntelliGenetics, May 8, 1981, p. 5. Stanford Special
Collections, Brutlag Papers, Fld IntelliGenetics. Emphasis in the original.
Details of the software licensing arrangement and the revenues generated are
discussed in a letter to Niels Reimers, Stanford Office of Technology Licensing
on the occasion of renegotiating the terms. See: Peter Friedland to Niels
Reimers. Subject: Software Licensing Agreement: April 2,1984. Stanford
University Special Collections, Fldr IntelliGenetics.
[36]
Roger Lewin, “National Networks for Molecular Biologists,”
Science
223
(1984): 1379-1380.
[37]
Lewin noted that this was the largest award of its kind by NIH to a for-profit
organization. See ibid., p. 1380.
[38]
Minutes of the Meeting of the National Advisory Committee for BIONET, March 23,
1985 (Final version prepared 1 August 1985),p. 4. In Stanford University
Special Collections, Brutlag Papers, Fld. BIONET.
[39]
“BIONET users status,” from BIONET managers’ meeting, April
3, 1986; and ibid., October 9, 1986. In Stanford University Special
Collections, Brutlag Papers, Fld. BIONET.
[40]
Joel Huberman. “BIONET: Computer Power for the Rest of Us.”
(1989): p. 1.
[41]
Peter Friedland, "BIONET Organizational Plans," 27 April 1984, Company
Confidential Memo. Stanford University Special Collections, Brutlag Papers,
Fldr BIONET, p. 1. A published version of these objectives appeared as: Dennis
H. Smith, Douglas Brutlag, Peter Friedland, and Laurence H. Kedes, “BIONET
tm:
national computer resource for molecular biology,”
Nucleic
Acids Research
,
14(1)(1986): 17-20.
[42]
IntelliGenetics,
Introduction
to Bionet
tm:
A Computer Resource for Molecular Biology
,
User manual for Bionet subscribers, Release 2.3, Mountain View, CA,
IntelliGenetics, 1987, p. 23, “Databases available on BIONET.”
[43]
Douglas Brutlag, Personal communication, June 19, 1999.
[44]
Steve Boswell, “Los Alamos Workshop—Exploring the Role of Robotics
and Automation in Decoding the Human Genome,” IntelliGenetics trip
report, January 9, 1987. In Stanford Special Collections, Brutlag Papers, Fld.
BIONET.
[45]
Barbara H. Duke, Contracting Officer, NIH, to IntelliGenetics, Inc. "Request
for Revised Proposal in Response to Request for Proposals RFP No. NIH-GM-87-04
entitled ‘Nucleic Acid Sequence Data Bank,’" June 3, 1987, Letter
with attachment. Stanford Special Collections, Brutlag Papers, Fld. BIONET.
[46]
See
PC/Gene:
Microcomputer Software for Protein Chemists and Molecular Biologists, User Manual
,
Mountain View, CA, IntelliGenetics, 1986, p. 99-120.
[47]
Douglas L Brutlag, and David Kristofferson. “BIONET: An NIH Computer
Resource for Molecular Biology.”
Biomolecular Data: A Resource in Transition
.
Ed. R. R. Colwell. Oxford: Oxford University Press, 1988. 287-294. Also see,
“Automatic Data Submission to GenBank, EMBL, and NBRF-PIR,”
BIONET
News
,
Vol 1, No. 1, April 1988.
[49]
Douglas Brutlag, Personal Communication, June 19, 1999. See nomination for
Smithsonian-Computerworld Award in Stanford Special Collections, Brutlag
Papers, Fld. Smithsonian Computerworld Award.
[50]
Quoted from Steve Boswell, “Los Alamos Workshop—Exploring the Role
of Robotics and Automation in Decoding the Human Genome,” IntelliGenetics
trip report, January 9, 1987, p. 2. In Stanford Special Collections, Brutlag
Papers, Fld. BIONET.
[51]
H. Morowitz,
Models
for Biomedical Research: A New Perspective
.
Washington, D.C., National Academy of Sciences Press, 1985, p. 21.
[54]
Board of Regents,
NLM
Long Range Plan (Report of the Board of Regents),
Bethesda, MD, National Library of Medicine, (1987).
[58]
Ibid., pp. 46-47. The figures for Medical Informatics were $7.4, $9.9, and $13
Mil for 1888-90.
[59]
See Christian B. Anfinsen, “Principles that Govern the Folding of Protein
Chains.”
Science
(1973). 181(Number 4096): 223-230 discusses the work for which he was awarded
the Nobel Prize for Chemistry in 1972: “This hypothesis (the
“thermodynamic hypothesis”) states that the three-dimensional
structure of a native protein in its normal physiological milieu...is the one
in which the Gibbs free energy of the whole system is lowest; that is, that the
totality of interatomic interactions and hence by the amino acid sequence, in a
given environment.” (p. 223)
[60]
Walter Gilbert, “Towards a Paradigm Shift in Biology,”
Nature,
349 (1991), p. 99
[62]
Douglas L. Brutlag, “Understanding the Human Genome,” in P. Leder,
D.A. Clayton, and E. Rubenstein, eds.,
Scientific
American: Introduction to Molecular Medicine
,
New York; Scientific American, Inc., 1994, pp. 153-168.
[63]
Walter Gilbert, “Towards a Paradigm Shift in Biology,”
Nature,
349 (1991), p. 99.
[64]
These are the disciplines graduate students and postdocs in molecular biology
in Brutlag’s lab at Stanford are expected to work with. Source: Douglas
Brutlag, “Department Review: Bioinformatics Group, Department of
Biochemistry, Stanford University,1998,” personal communication.
[65]
M.G. Rossman, D. Moras, K.W. Olsen, “ Chemical and Biological Evolution
of a Nucleotide-binding Protein,”
Nature
250 (1974): pp. 194-199; T.E. Creighton,
Proteins:
Structure and Molecular Properties
,
New York: W.H. Freeman and Co., 1983; J.J. Birktoft, L.J. Banaszak,
“Structure-function relationships Among Nicotinamide-adenine Dinucleotide
Dependent Oxidorecuctases,” in M.T.W. Hearn, ed.,
Peptide
and Protein Reviews
,
New York; Marcel Dekker, Vol. 4, 1984, pp.1-47.
[66]
S. B. Needleman and C. D. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins.”
Journal
of Molecular Biology
48(1970): 443.
[67]
Since insertions and deletions (gaps) within a motif are not easily handled
from a mathematical point of view, a more technical term, “alignment
block,” has been introduced that refers to conserved parts of multiple
alignments containing no insertions or deletions. Peer Bork, and Toby J.
Gibson. “Applying Motif and Profile Searches,”
Computer Methods for Macromolecular Sequence Analysis
.
Ed. R.F. Doolittle. Vol. 266. Methods in Enzymology. San Diego: Academic Press,
1996. 162-184, especially p. 163.
[68]
J. S. Richardson and D. C. Richardson Principles and Patterns of Protein
Conformation.
Prediction
of Protein Structure and the Principles of Protein Conformation
.
G. D. Gasman, ED. New York, Plenum Press,1989.
[69]
R. L. Dorit, L. Schoenback, and W. Gilbert. “How big is the universe of
exons?”
Science
250 (1990): p. 1377.
[70]
A. Bairoch. “PROSITE: A Dictionary of Sites and Patterns in
Proteins.”
Nucleic
Acids Research
19
(1991): 2241.
[71]
Bork, Ouzounis, and Sander state that the likelihood of identifying homologues
is currently higher than 80% for bacteria, 70% for yeast, and about 60% for
animal sequence series. See P. Bork, C. Ouzounis, and C. Sander,
Current
Opinions in Structural Biology
4 (1994): 393; Peer Bork, and Toby J. Gibson. “Applying Motif and Profile
Searches,”
cited
in note 76 above.
[72]
Laszlo
Patthy. “Consensus Approaches in Detection of Distant Homologies.”
Computer Methods for Macromolecular Sequence Analysis
.
Ed. R.F. Doolittle. Vol. 266. Methods in Enzymology. San Diego: Academic Press,
1996. 184-198.
[73]
R. G. Brennan and B. W. Mathews, “The Helix-Turn-Helix Binding
Motif.”
Journal
of Biological Chemistry
264(1989): 1903.
[74]
M. Gribskov, A. D. McLachlan, et al. “Profile analysis: Detection of
distantly Related Proteins,”
Proceedings
of the National Academy of Sciences
84(1987):
4355; M. Gribskov, M. Homyak, et al. “Profile scanning for
three-dimensional structural patterns in protein sequences,”
Computer
Applications in the Biosciences
4(1988):
61.
[75]
R. M. Schwartz and M.O. Dayhoff, “Matrices for Detecting Distant
Relationships”
Atlas
of Protein Structure
5(1979) (Supplement 3): p. 353.
[76]
S. C. Hyde, P. Emsley, et al. (1990). “Structural Model of ATP-binding
Proteins Associated with Cystic fibrosis, Multidrug Resistance and Bacterial
Transport.”
Nature
346: 362-365; B.S. Kerem, J. M. Rommens, et al. “Identification of the
Cystic Fibrosis Gene: Genetic Analysis,”
Science
245(1989): 1073-1080; B. S. Kerem, J. Zielenski, et al. “Identification
of Mutations in Regions Corresponding to the Two Putative Nucleotide
(ATP)-Binding Folds of the Cystic Fibrosis Gene,”
Proceedings
of the National Academy of Sciences
87(1990): 8447-8451; J. R. Riordan, J. M. Rommens, et al.,
“Identification of the Cystic Fibrosis Gene: Cloning and Characterization
of Complementary RNA,”
Science
245(1989): 1066-1073.
[77]
Hyde, Emsley, et al. used the Chou-Fasman algorithm (1973)for identifying
consensus sequences and the Quantatm modeling package produced by Polygen
Corp., Waltham, Mass. for modeling the protein and its binding sites. See, S.
C. Hyde, P. Emsley, et al. (1990). “Structural Model of ATP-binding
Proteins Associated with Cystic fibrosis, Multidrug Resistance and Bacterial
Transport.”
Nature
346: 362-365.
[78]
D. Brutlag, “Understanding the Human Genome”
Scientific
American Introduction to Molecular Medicine
.
P. Leder, D. A. Clayton and E. Rubenstein, Eds. New York, NY, Scientific
American, Inc., 1994: pp. 164-166.
[79]
See for instance the National Institute of General Medical Science,
“(NIGMS), Protein Structure Initiative Meeting Summary,” April 24,
1998, at: http://www.nih.gov/nigms/news/reports/protein_structure.html
[80]
I have focused on the development of software in this discussion. But a further
crucial stimulation to the takeoff of bioinformatics, of course, are hardware
and networking developments. The growth of databases and complexity of the
searches that were to be undertaken stimulated the demand for faster
algorithms, more powerful computer systems, and network bandwidth. At the
beginning of this “bioinformatics revolution” in the 1970s, for
example, a search on a DNA sequence of typical size would be performed by a
computer capable of performing one million instructions per second (one MIP)
and would take approximately 15 minutes. Throughout the late 1970s and 1980s
mini-computers and personal computer workstations continued to increase in
power at about the same rate as the growth of the databases, so that a typical
search still took around 15 minutes. By the end of the 1980s, however, the
growth in sequence data—now hundreds of megabytes in size—had
overtaken the ability of computers to search it with acceptable turnaround
time. Shortcut search methods and more efficient code helped, but the most
rigorous and sensitive searches began to require hours of computing time to
align and score even a single query sequence against a database of sequences.
The NIH and NSF responded to the challenge by supporting research and
development of new computer architectures, regional supercomputer centers and
several large-scale computing initiatives. (see Thomas P. Hughes, et al., ed
.,
Funding a Revolution: Government Support for Computing Research
,
Washington, D.C., National Academy Press, 1999.) Commercial vendors such as
DEC, SUN Microsystems, Cray Computers, and MasPar Computer Corporation tried to
meet the large-scale computing needs of geneticists with, for example,
massively parallel computers, such as the MasPar MP-1 computer. In early 1992,
the MasPar MP-1104 with 4,096 processors could search the entire Swiss-Protein
database in 30 seconds with a query of 100 amino acids, and a query of 1000
amino acids could be executed on the GenBank database (74,000 sequences) in 15
minutes. (see IntelliGenetics, Inc., and MasPar Computer corporation,
“BLAZE: A Massively Parallel Sequence Similarity Search Program for
Molecular Biologists,” Product Information Bulletin, May 1992.)
[82]
See especially the papers in L. Hunter, ed.,
Artificial
Intelligence and Molecular Biology
,
Menlo Park, CA, AAAI Press, 1993.
[83]
C. G. Nevill-Manning, T. D. Wu, et al. “Highly Specific Protein Sequence
Motifs for Genomic Analysis.”
Proceedings
of the American Academy of Sciences
95(1998): in press. EMOTIF can be viewed at
http://motif.stanford.edu/emotif