Science and the Academy
of the 21st Century: Does Their Past Have a Future in an Age
Computer-Mediated Networks?
Since their inception as formal
institutions in the seventeenth century academies of science have been imagined
above all as sites of optimal communication. Historians of the early
state-recognized academies of science in Paris and London have pointed to three
key features in the transition from amateur groups of aficionados and virtuosi
to learned societies: Each organization
had a communal mission to share information within its membership; each also
had a recording secretary and a corresponding secretary who disseminated the
group's collective findings in some form of journal or newsletter; each group
also placed a high value on experimentation, and prided itself on its
laboratory as much as on its secretary and journal. While agreeing that due
emphasis should be placed on the role of the journal in helping to establish
international communities of natural philosophers interacting critically with
one another's work, scholars such as Elizabeth Eisenstein and Bruno Latour go
even further in emphasizing the material features of scientific communications,
particularly the late 15th-century communications revolution in
printing and improvements in postal delivery, as crucial to the entire
scientific revolution itself.[1]
In her work on the printing press as an agent of change, for instance,
Eisenstein observes that prior to the
innovative intellectual contributions of the Copernicuses, the Galileos, the
Keplers, and the Newtons of the Scientific Revolution, there was the quiet but
vitally important circulation of astronomical tables, such as the Alphonsine
Tables, diagrams accompanying mathematical demonstrations and astronomical
constructions, maps of all kinds, and images of plants. According to
Eisenstein, in contrast to the scribal tradition of copying manuscripts the
printed text became the anchor for feedback, sustained discussion, and
incremental cumulative improvement of the information communicated. When early
academy founders referred to the need for creating an international community
and of extracting their enterprise from verbal dispute, it was the new visual
language in the pictorial statements of Vesalius or the triangles of Galileo or
the illustrations of the moon's surface that made it possible. They boldly
proclaimed that natural philosophers should turn away from bookish learning and
learn through observation and experiment to read the book of nature directly.
But crucial to this perspective is that before natural philosophers could begin
to read the book of nature, nature had to be inscribed in book form.
The thesis set forth by
Eisenstein and transformed by Latour into a theory of "immovable
mobiles" for creating actor-networks has been critiqued for its implicit
technological determinism by Adrian Johns in his brilliant study, The Nature of the Book: Print and Knowledge
in the Making. Johns argues that before the introduction of lithography in
the 19th-century "fixity" was anything but a firm property
of printed texts--and he in fact disputes "fixity" ever came into
being. In any case, the focus on technology displaces the important issue in
Johns' view. Rather than treating fixity as an inherent property of printed
texts which they carry with them from place to place, we should explore the
processes for calibrating local reading practices required to make a text
produced in one site authoritative in another. Such an investigation calls for
the recognition that fixity exists only when it is recognized and acted upon by
people. Accordingly print culture is a result of various contested
representations, practices and conflicts rather than a deterministic consequence
of print technology.[2]
I couldn’t agree more. But to be
fair to Eisenstein (and Latour) the position they were arguing against paid lip
service at best to the role of technology. A statement by Archer Taylor among
numerous others that could be cited illustrates the position they are arguing
against:
The powers which shape men's lives may be
expressed in books and type, but by and of itself printing…is only a tool, an
instrument, and the multiplication of tools and instruments does not of itself
affect intellectual and spiritual life.[3]
The point is that print technology was not
just any tool. While not succumbing to an "impact" model, we should
examine the special features that distinguish printing from other innovations.
Through its development, modification, and adaptation to the interests of
particular groups printing technology became the material artifact around which
new forms of communication and new intellectual configurations came about, one
of them being natural science as we know it. As Eisenstein observed, one cannot
treat printing as just one among many elements in a complex causal nexus, for
the communications shift transformed the nature of the causal nexus itself.
What is at stake here is the construction of the printed text as medium; and that story cannot be one
simply of technology because the text cannot compel readers to react in certain
ways. Crucial to such a program is a shift away from emphasis on the technology
alone to focus on the construction of readers, and more broadly on the moral
values of trust and convention in the making of knowledge, and on how readers
decide what to believe when confronted with printed materials. To put it in its
most general terms: media are
institutions. The task of the media historian is twofold: On the one hand
to understand how the medium is constructed in a co-evolutionary entanglement
of technical artifacts, such as the various elements for producing printed
works, the organizations for distributing them, and social roles such as that
of author, publisher, and reader; and on the other hand the configurations of
knowledge and practice for those who work within the medium.
The controversies around print
culture and the history of the book provide a point of departure for
considering the topic that interests me today: namely the transformation in
scholarly practices and indeed in the reconfiguration of knowledge and
knowledge-production being brought about by the communications revolution of
our own day, computer-mediated communication. I want to concentrate on fields
of biomedical research where computer-mediated communication is having some of
its most dramatic effects, particularly through the emergence of genomics and
post-genomic science. If we think of the work of Watson, Crick, Monod and Jacob,
Gamow, Nirenberg, and others as having launched biology into an era of
exploring the structures for encoding and replicating biological information,
then the human genome project and its ancillary tools for reading, writing,
indexing and searching (analogous in many ways to the "knowledge
industry" described by Eisenstein) massive libraries of biological data
are creating its second (science-based) industrial revolution. More
importantly, as these new tools from information science have been adopted within
biology, the conceptual terrain and indeed many of the material practices of
biologists have been and are continuing to be radically transformed. It hasn't
happened by simply booting up a computer. Controversy, labor, and the sustained
contributions of numerous groups of scientists and engineers over two decades
have gone into shaping these social and technical networks into a new medium.
The development of bioinformatics, which can be generally defined as the study of how information technologies are used
to solve problems in biology, offers a useful window for illustrating
those changes. Today I want to suggest that the outcome of these changes will
be the fusion of the communication and experimentation functions--the merging
of the journal and the lab--in the post-modern academy.
The Holy Grail of many
biologists since the 1960s has been the construction of a mathematized,
predictive theoretical biology. An early vision of how to get there was
directed by the notion that the information for the three dimensional folding
and structure of proteins is uniquely contained in the linear sequence of their
amino acids and that understanding the dynamics of protein folding would
provide the basis for computational biology.[4]
The molecular dynamics approach assumed that if all the forces between atoms in
a molecule, including bond energies and electrostatic attraction and repulsion,
are known, then it is possible to calculate the three-dimensional arrangement
of atoms that requires the least energy. While theoretically elegant, the
determination of protein structure from chemical and dynamical principles has
been hobbled with difficulties. In the abstract, analysis of physical data
generated from protein crystals, such as x-ray and nuclear magnetic resonance
data, should offer rigorous ways to connect primary amino acid sequences to 3D
structure. But the problems of acquiring good crystals and the difficulty of
getting NMR data of sufficient resolution are impediments to this approach.
These and several related difficulties have contributed to the slow rate of
progress in registering atomic coordinates of macromolecules.[5]
Moreover, while quantum mechanics provides a solution to the protein folding
problem in theory, the computational task of predicting structure from first
principles for large protein molecules containing many thousands of atoms has
proven impractical. Shortcuts have been developed that combine molecular
dynamics computations, artificial intelligence and interactive computer
graphics in deriving protein structure directly from chemical structure. But
the task is still daunting.
While structure determination
was moving at a snail’s pace, beginning in the 1970s another stream of work
contributed to the transformation of biology as an information science. In the
mid-1970s new tools of molecular biology, such as recombinant DNA techniques,
gene cloning, restriction enzymes, protein sequencing, and gene product
amplification began to emerge. Biologists were suddenly awash in a sea of new
data. They deposited this data in large and growing electronic databases of
genetic maps, atomic coordinates for chemical and protein structures, and
protein sequences. Indeed more than 140,000 genes were cloned and sequenced in
the twenty years from 1974 to 1994, of which more than 20 percent were human
genes.[6]
By the early 1990s, at the beginning of the Human Genome Initiative, the NIH
GenBank database (release 70) contained more than 74,000 sequences, while the
Swiss Protein database (Swiss-Prot) included nearly 23,000 sequences. Protein
databases were doubling in size every 15 months, and some were predicting that
as a result of technological impact of the Human Genome Initiative by the year
2000 ten million base pairs a day would be sequenced.
Such an explosion of data and
its registration in databases encouraged the development of a second approach
to determining function and structure of protein sequences: namely, prediction
from sequence data alone by applying artificial intelligence, expert systems
and developing search tools to identify structures and patterns in their data.
That is the vision of bioinformatics, a
discipline less than a decade old. It studies two important information flows
in modern biology. The first is the flow of genetic
information from the DNA of an individual organism up to the
characteristics of a population of such organisms (with an eventual passage of
information back to the genetic pool, as encoded within DNA). The second is the
flow of experimental information from observed biological phenomena to models
that explain them, and then to new experiments in order to test these models.
A key project illustrating the
ways in which computer science and molecular biology began to merge in the
formation of bioinformatics was the molgen
project at Stanford and events related to the formation and subsequent
development of BIONET. molgen was
a continuation of projects in artificial intelligence and knowledge engineering
begun at Stanford with dendral during
the 1960s. molgen was started in
1975 as a artificial intelligence project in the Heuristic Programming Project
with Edward Feigenbaum as principal investigator directing the thesis projects
of Mark Stefik and Peter Friedland.[7]
The aim of molgen was to model
the experimental design activity of scientists in molecular genetics.[8]
Before an experimentalist sets out to achieve some goal, he produces a working
outline of the experiment, guiding each step of the process. The central idea
of molgen was that in designing a
new experiment scientists rarely plan from scratch. Instead they find a
skeletal plan, an overall design that has worked for a related or more abstract
problem, and then adapt it to the particular experimental context. Similar to dendral this approach is heavily
dependent upon large amounts of domain-specific knowledge in the field of
molecular biology and especially upon good heuristics for choosing among
alternative implementations.
molgen’s designers chose molecular biology as
appropriate for the application of artificial intelligence because the
techniques and instrumentation generated in the 1970s seemed ripe for
automation. The advent of rapid DNA cloning and sequencing methods had had an
explosive effect on the amount of data that could be most readily represented
and analyzed by a computer. Moreover, it appeared that very soon progress in
analyzing information in DNA sequences would be limited by the appropriate
combination of the available search and statistical tools. molgen was intended to apply rules to
detect profitable directions of analysis and to reject unpromising ones.[9]
Peter Friedland was responsible
for constructing the knowledge-base component of molgen, and though not himself a molecular biologist, he made
a major contribution to the field by assembling the rules and techniques of
molecular biology into an interactive, computerized system of analytical
programs. Friedland worked with Stanford molecular biologists Douglas Brutlag,
Laurence Kedes, John Sninsky, and Rosalind Grymes, who provided expert
knowledge on enzymatic methods, nucleic acid structures, detection methods, and
pointers to key references in all areas of molecular biology. Brutlag, Kedes,
Sninsky, and Grymes were interested in having a battery of automated tools for
sequence analysis, and they contracted with Friedland and Stefik—both gifted
computer program designers—to build them in exchange for contributing their
expert knowledge to the project.[10]
This collaboration of computer scientists and molecular biologists helped
biology along the road to becoming an information science.
An example of the programs
Friedland and Stefik created for molgen was
seq, an interactive
self-documenting program for nucleic acid sequence analysis which had 13 different
procedures with over 25 different sub-procedures, many of which could be
invoked simultaneously to provide different analytical methods for any sequence
of interest. seq brought together
in a single program methods for primary sequence analysis described in the
literature.[11] seq also performed homology searches
and specified the degree of homology and dyad symmetry (inverted repeats)
searches on DNA sequences.[12]
Another feature of seq was its
ability to prepare restriction maps with the names and locations of the
restriction sites marked on the nucleotide sequence in addition to having a
facility for calculating the length of DNA fragments from restriction digests
of any known sequence.
In its first phase of
development (1977-1980) molgen consisted
of such programs described above and a knowledge base containing information on
about 300 laboratory methods and 30 strategies for using them. It also
contained the best currently available data on about 40 common phages,
plasmids, genes, and other known nucleic acid structures. The second phase of
development beginning in 1980 scaled up both analytical tools and knowledge
base. Perhaps the most significant aspect of the second phase was making molgen available to the scientific
community at large on the Stanford University Medical Experimental national
computer resource, SUMEX-AIM. SUMEX-AIM, supported by the Biotechnology
Resources Program at NIH since 1974, had been home to dendral and several other programs. The new experimental
resource on SUMEX, comprising the molgen
programs and access to all major genetic databases, was called genet. In February 1980 genet was made available to a carefully
limited community of users.[13]
molgen and genet
were immediate successes with the molecular biology community. In their
first few months of operation in 1979 more than 200 labs (with several users in
each of those labs) accessed the system. By November 1, 1982 more than 300 labs
from 100 institutions accessed the system around the clock.[14]
Traffic on the site was so heavy that restrictions had to be implemented and
plans for expansion considered. In addition to the academic users a number of
biotech firms, such as Monsanto, Genetech, Cetus, and Chiron, used the system
heavily. In order to insure that the academic community had unrestricted access
to the SUMEX computer and that the NIH would be satisfied commercial users were
not getting unfair access to the resource, Feigenbaum, principle investigator
in charge of the SUMEX resource, and Thomas Rindfleisch, facility manager, decided
to exclude commercial users.[15]
To provide commercial users with
their own unrestricted access to genet and
molgen programs, Brutlag,
Feigenbaum, Friedland, and Kedes formed a company, IntelliGenetics, which would
offer the suite of molgen software
for sale or rental to the emerging biotechnology industry. With 125 research
labs doing recombinant DNA research in the US alone and a number of new genetic
engineering firms starting up, opportunities looked outstanding. Numerous firms
were being formed with staffs of molecular biologists exceeding 50 individuals,
and several were planning to hire over 1,000 Ph.D.s in molecular biology. No
one was currently supplying software in this rapidly growing genetic
engineering marketplace. With their exclusive licensing arrangement with
Stanford for the molgen software,
IntelliGenetics was poised to lead a huge growth area.[16] The resource that IntelliGenetics
eventually offered to commercial users was bionet.
Like genet, its prototype, bionet combined in one computer site
databases of DNA sequences with programs to aid in their analysis.
Prior to the startup of bionet, genet was not the only resource for DNA sequences. Several
researchers were making their databases available. Margaret Dayhoff had created
a database of DNA sequences and some software for sequence analysis for the
National Biomedical Research Foundation that was marketed commercially. Walter
Goad, a physicist at Los Alamos National Laboratory, collected DNA sequences
from the published literature and made them freely available to researchers.
But by the late 1970s the number of bases sequenced was already approaching 3
million and expected to double soon. Some form of easy communication between
labs and effective data handling was considered a major priority in the
biological community. While experiments were going on with genet a number of nationally prominent
molecular biologists had been pressing to start a NIH-sponsored central
repository for DNA sequences. An early meeting organized by Joshua Lederberg
was held in 1979 at Rockefeller University. The proposed NIH initiative was
originally supposed to be coordinated with a similar effort at the European
Molecular Biology Laboratory (EMBL) in Heidelberg, but the Europeans became
dissatisfied with the lack of progress on the American end and decided to go
ahead with their own databank. EMBL announced the availability of its
Nucleotide Sequence Data Library in April 1982, several months before the
American project was funded. Finally, in August, 1982 the NIH awarded a
contract for $3 million over 5 years to the Boston-based firm of Bolt, Berenek,
and Newman (BB&N) to set up the national database known as GenBank in
collaboration with Los Alamos National Laboratory.
Although GenBank launched a
formal national DNA sequence collection effort, the need for computational
facilities voiced by molecular biologists was still left unanswered. In
September 1983 after a review process that took over a year, the NIH division
of research resources awarded IntelliGenetics a $5.6 million five year contract
to establish bionet, in part to
address the need for a national center for computational analysis of DNA.[17]
The contract started on March 1, 1984 and ended on February 27, 1989.
bionet first became available to the research
community in November 1984. The fee for use was $400 per year per laboratory,
and remained at that level throughout its first five years. bionet’s use grew impressively.
Initially the IntelliGenetics team set the target for user subscriptions at 250
labs. However the annual report for the first year’s activities of bionet in March, 1985 listed 350 labs
with nearly 1132 users. By August 1985 that number had increased dramatically
to 450 labs and 1500 users.[18]
By 1989 900 laboratories in the U.S., Canada, Europe, and Japan (comprising
about 2800 researchers) subscribed to bionet,
and 20 to 40 new laboratories joined each month.[19]
bionet was intended to establish a national
computer resource for molecular biology satisfying three goals. A first goal
was to provide a way for academic biologists to obtain access to computational
tools to facilitate their nucleic acid (and possibly protein) related research.
In addition to giving researchers ready access to national databases on DNA and
protein sequences, bionet would
provide a library of sophisticated software for sequence searching, matching,
and manipulation. A second goal was to provide a mechanism to facilitate
research into improving such tools. The bionet
contract provided research and development support for additional software. A
third goal of bionet was to
enhance scientific productivity through electronic communications.
The
stimulation of collaborative work through electronic communication was perhaps
the most impressive achievement of bionet.
bionet was much more than the
Stanford genet plus the molgen-IntelliGenetics suite of
software. Whereas genet with its
pair of ports could accommodate only two users at any one time, bionet had 22 ports providing an
estimated annual 30,000 connect hours.[20]
All subscribers to bionet were
provided with email accounts. For most molecular biologists this was something
entirely new, since most university labs were just beginning to be connected
with regular email service. At least 20 different bulletin boards on numerous topics
were supported by bionet. In an
effort to change the culture of molecular biologists by accustoming them to the
use of electronic communications and more collaborative work, bionet users were required to join one
of the bulletin board groups.
bionet subscribers
had access to the latest versions of the most important databases for molecular
biology, including (i) GenBanktm; (ii) EMBL, the European Molecular
Biology Laboratory nucleotide sequence library; (iii) NBRF‑PIR, the
National Biomedical Research Foundation's protein sequence database;(iv) SWISS‑PROT;
(v) VectorBanktm, a database of cloning vector restriction maps and
sequences; (vi) Restriction Enzyme Library, a complete list of restriction
enzymes and cutting sites provided by Richard Roberts at Cold Spring Harbor;
and (vii) Keybank, IntelliGenetics' collection of predefined patterns or
"keys" for database searching. Several smaller databases were also
available, including a directory of molecular biology databases, a collection
of literature references to sequence analysis papers, and a complete set of
detailed molecular biological laboratory protocols (especially for E. coli and yeast work).[21]
Perhaps the most important
contribution made by bionet to
establishing molecular biology as an information science was negotiated at the
renewal of the contract to manage GenBank in 1987. BB&N was 2 years behind
in publishing and disseminating sequence data it had received, making GenBank
about 20% out of date compared to other commercially available databases.[22]
Concerned that researchers would turn to other data sources, the NIH insisted
that IntelliGenetics solve the problem.[23] IntelliGenetics proposed to solve this
problem by automating the submission of gene and protein sequences. Instead of
laboriously searching the published scientific literature for sequence data,
rekeying them into a GenBank standard electronic format, and checking them for
accuracy, which was the standard method employed at that time, IntelliGenetics
automated the submission procedure with an online submission program, xgenpub (later called “authorin”).[24]
Creating a new culture requires
both the carrot and the stick. Making the online programs available and easy to
use was one thing. Getting all molecular biologists to use them was another. In
order to doubly encourage molecular biologists to comply with the new procedure
of submitting their data online, the major molecular biology journals agreed to
require evidence that the data had been submitted before they would consider a
manuscript for review. Nucleic Acids
Research was the first journal to enforce this transition to electronic
data submission.[25] With these
new policies and networks in place, bionet
was able to reduce the time from submission to publication and distribution of
new sequence data from two years to 24 hours. As noted above, just a few years
earlier, at the beginning of bionet,
there were only 10 million base pairs published, and these had been the result
of several years’ effort. The new electronic submission of data generated 10
million base pairs a month.[26]
Walter Gilbert may have angered some of his colleagues at the 1987 Los Alamos
Workshop on Automation in Decoding the Human Genome when he stated that,
“Sequencing the human genome is not science, it is production.”[27]
But he surely had his finger on the pulse of the new biology.
The
explosion of data on all levels of the biological continuum made possible by
the new biotechnologies and represented powerfully by organizations such as bionet was a
source of both exhilaration and anxiety. Of primary concern to many biologists
was how best to organize this massive outpouring of data in a way that would
lead to deeper theoretical insight, perhaps even a unified theoretical
perspective for biology. The National Institutes of Health were among those
most concerned about these issues, and they organized a series of workshops to
consider the new perspectives emerging from recent developments. The meetings
culminated in a report chaired by Harold Morowitz entitled Models for Biomedical Research: A New Perspective (1985). The
panelists foresaw the emergence of a new theoretical biology “different from
theoretical physics, which consists of a small number of postulates and the
procedures and apparatus for deriving predictions from those postulates.” The
new biology was far more than just a collection of experimental observations.
Rather it was conceived as a vast array of information gaining coherence
through organization into a multi-dimensional matrix of biological knowledge,[28]
the complete data base of published biological experiments structured by the
laws, empirical generalizations, and physical foundations of biology and
connected by all the interspecific transfers of information.[29]
Moreover this matrix of biological knowledge would be tied to the use of
computers, which would be required to deal with the vast amount and complexity
of the information.[30]
In its
Long Range Plan of 1987 the Board of Regents of the National Library of
Medicine further elaborated on the notion of the matrix of biological knowledge
explicitly in terms of fashioning the new biology as an information science.[31]
In the view of the panel, the field of molecular biology was opening the door
to an era of unprecedented understanding and control of life processes,
particularly through the “automated methods now available to analyze and modify
biologically important macromolecules.”[32]
Due to the complexity of biological systems, basic research in the life
sciences would be increasingly dependent on automated tools to store and
manipulate the large bodies of data describing the structure and function of
important macromolecules. According to the NIH because of new automated
laboratory methods, genetic and biochemical data are accumulating far faster
than they can be assimilated into the scientific literature. The problems of
scientific research in biotechnology, the NIH stated, are increasingly problems
of information science.[33]
To support
and promote the entry into the new age of biological knowledge the NIH
recommended building a National Center for Biotechnology Information to serve
as a repository and distribution center for this growing body of knowledge and
as a laboratory for developing new information analysis and communications
tools essential to the advance of the field. The proposal recommended $12.75
mil per year for 1988-1990, with an additional $10 mil per year for work in
medical informatics.[34]
The program would emphasize collaboration between computer and information
scientists and the biomedical researcher. In addition the NIH would support
research in the areas of molecular biology database representation,
retrieval-linkages, and modeling systems, while examining interfaces based on
algorithms, graphics and expert systems. The recommendation also called for the
construction of online data delivery through linked regional centers and
distributed database subsets.
The recent
explosive growth of sequencing
data that began to become available in university and company databases, and
more recently publicly through the Human Genome Initiative has produced a
paradigm shift in both the intellectual and institutional structures of
biology. According to some of the central players in this transformation, at
the core is biology’s switch from having been an observational science, limited
primarily by the ability to make observations, to being a data-bound science
limited by its practitioner’s ability to understand large amounts of
information derived from observations. To understand the data the tools of
information science have not only become necessary handmaidens to theory: they
have also fundamentally changed the picture of biological theory itself. To use
this flood of knowledge, which will pour across the computer networks of the
world, biologists not only must become computer-literate, but also change their
approach to the problem of understanding life. Walter Gilbert characterizes the
situation sharply:
The next tenfold increase in the amount of
information in the databases will divide the world into haves and have-nots,
unless each of us connects to that information and learns how to sift through
it for the parts we need.[35]
Gilbert goes on to describe the newly
forming genomic view of biology:
The new paradigm now emerging is that all
the “genes” will be known (in the sense of being resident in databases
available electronically), and that the starting point of a biological
investigation will be theoretical. An individual scientist will begin with a
theoretical conjecture, only then turning to experiment to follow or test that
hypothesis. The actual biology will continue to be done as “small
science”—depending on individual insight and inspiration to produce new
knowledge—but the reagents that the scientist uses will include a knowledge of
the primary sequence of the organism, together with a list of all previous
deductions from that sequence.[36]

Genomics, computational biology, and
bioinformatics restructure the playing field of biology, bringing a
substantially modified toolkit to the repertoire of molecular biology skills
developed in the 1970s. Along with the biochemistry components new skills are
now required, including machine learning, robotics, databases, statistics and
probability, artificial intelligence, information theory, algorithms, and graph
theory.[37]
Proclamations of the sort made
by Gilbert and other promoters of genomics may seem like hyperbole. But the
Human Genome Initiative and the information technology that enables it has
changed molecular biology in fundamental ways, and indeed, may suggest similar
changes in store for other domains of science. The online DNA and protein
databases I have described have not just been repositories of information for
insertion into the routine work of molecular biology, and the software programs
discussed in connection with IntelliGenetics and GenBank are more than
retrieval aids for transporting that information back to the lab. As a set of
final reflections, I want to look in more detail at some ways this software has
been used to address the problems of molecular biology in order to gain a sense
of the changes taking place.
A dramatic illustration of how
sequence alignment tools can be brought to bear on determining function and
structure is provided by the case of cystic fibrosis. Cystic fibrosis is caused
by aberrant regulation of chloride transport across epithelial cells in the
pulmonary tree, the intestine, the exocrine pancreas, and apocrine sweat
glands. This disorder was identified as due to defects in the cystic fibrosis
transmembrane conductance regulator protein (CFTR). The CFTR gene was isolated
in 1989, and subsequently identified as producing a chloride channel whose
activity depends on phosphorylation of particular residues within the
regulatory region of the protein. Using computer-based sequence alignment tools
of the sort described above, it was established that a consensus sequence for
nucleotide binding folds that bind ATP are present near the regulatory region
and that 70 percent of cystic fibrosis mutations are accounted for by a 3
base-pair deletion that removes a phenylalanine residue within the first
nucleotide binding position. A significant portion of the remainder of cystic
fibrosis mutations affect a second nucleotide-binding domain near the
regulatory region.[38]
In working out the folds and
binding domains for the CFTR protein Hyde, Emsley, Hartshorn, et al. (1990)
used sequence alignment methods similar to those available in early models of
the IntelliGenetics software suite.[39]
In 1992 IntelliGenetics introduced blaze,
an even more rapid search program running on a massively parallel computer. As
an example of how computational genomics can be used to solve
structure-function problems in molecular biology, Brutlag repeated the CFTR
case using blaze.[40]
A sequence similarity search compared the CFTR protein to more than 26,000
proteins in a protein database of more than 9 million residues, resulting in a
list of 27 top similar proteins, all of which strongly suggested the CFTR
protein is a membrane protein involved in secretion. Another feature of the
comparison result was that significant homologies were shown with ATP-binding
transport proteins, further strengthening the identification of CFTR as a
membrane protein. The search algorithm identified two consensus sequence motifs
in the protein sequence of the cystic fibrosis gene product that corresponded
to the two sites on the protein involved in binding nucleotides. The search also turned up distant homologies
between the CFTR protein and proteins of E.
coli and yeast. The entire search took three hours. Such examples offer
convincing evidence that tools of computational molecular biology can lead to
the understanding of protein function.
The methods for analyzing
sequence data discussed above were just the beginnings of an explosion of
database mining tools for genomics that is continuing to take place.[41]
In the process biology is becoming even more aptly characterized as an
information science.[42]
Advances in the field have led to large-scale automation of sequencing in
genome centers employing robots. In order to keep pace with this flood of data
emerging from automated sequencing, genome researchers have in turn looked
increasingly to artificial intelligence, machine learning, and even robotics in
developing automated methods for discovering patterns and protein motifs from
sequence data. Many molecular biologists
who welcomed the Human Genome Initiative with open arms undoubtedly believed
that when the genome was sequenced everyone would return to the lab to conduct
their experiments in a business-as-usual fashion, empowered with a richer set
of fundamental data. The developments in automation, the resulting explosion of
data, and the introduction of tools of information science to master this data
have changed the playing field forever: in the words of genome scientist Hans
Lehrach, there may be no “lab” to return to. In its place is a workstation
hooked to a massively parallel computer, producing simulations by drawing on
the data streams of the major databanks and carrying out “experiments” in silico rather than in vitro. Elizabeth Eisenstein, Bruno
Latour, and Adrian Johns have argued that a pre-condition for science as we
know it is the elaborate apparatus and organization of practice for
transcribing nature into a form compatible with institutions of the letter; and
in his own work on a center for molecular biology research, Latour argued--now
famously--that modern scientific laboratories are in effect inscription devices
and that the din of their operation is carefully hidden and ultimately silenced
in the production of scientific facts.[43]
I have argued that the fusion of the laboratory and contemporary forms of
computer-mediated communication offers a new--perhaps final--twist to this
position by the erasure of the wet lab from the academy altogether. The result
of biology’s metamorphosis into an information science just may be the
relocation of the lab to the industrial park and the dustbin of history.
endnotes
[1] Bruno Latour, "How to Be Iconophilic in Art, Science and Religion?," in Carolyn A. Jones and Peter Galison, eds., Picturing Science Producing Art, New York; Routledge, 1998, pp. 418-440, especially pp.424-428; Bruno Latour, "Drawing Things Together," in Michael Lynch and Steve Woolgar, eds., Representation in Scientific Practice, Cambridge, Mass.; MIT Press, 1990, pp. 19-68; Also see, Bruno Latour, Science in Action: How to Follow Scientists and Engineers Through Society, Cambridge, Mass.; Harvard University Press, 1987.
[2] Adrian Johns, The Nature of the Book: Print and Knowledge in the Making, Chicago; Chicago University Press, 1998, p. 19.
[3] Elizabeth L. Eisnestein, The Printing Press as an Agent of Change 2 Vols, (Cambridge; Cambridge University Press, 1979), Vol. 2, p. 703.
[4] See Christian B. Anfinsen, “Principles that Govern the Folding of Protein Chains.” Science (1973). 181(Number 4096): 223-230 discusses the work for which he was awarded the Nobel Prize for Chemistry in 1972: “This hypothesis (the “thermodynamic hypothesis”) states that the three-dimensional structure of a native protein in its normal physiological milieu…is the one in which the Gibbs free energy of the whole system is lowest; that is, that the totality of interatomic interactions and hence by the amino acid sequence, in a given environment.” (p. 223)
[5] An indicator of the difficulty of pursuing this approach alone is suggested by the growth of databanks of atomic coordinates for proteins. The Protein Data Bank (PDB) was established in 1971 as a computer-based archival resource for macromolecular structures. The purpose of the PDB was to collect, standardize, and distribute atomic co-ordinates and other data from crystallographic studies. In 1977 the PDB listed atomic coordinates for 47 macromolecules. In 1987 that number began to increase rapidly at a rate of about 10 percent per year due to the development of area detectors and widespread use of synchrotron radiation, so that by April 1990 atomic coordinate entries existed for 535 macromolecules. Commenting on the state of the art in 1990, Holbrook et al. noted that crystal determination could require one or more man-years. Bernstein, F. C., T. F. Koetzle, et al. (1977). “The Protein Data Bank: A computer based archival file for macromolecular structure.” Journal of Molecular Biology 112: 535-542.) Currently (1999) the Biological Macromolecule Crystallization Database (BMCD) of the Protein Data Bank contains entries for 2526 biological macromolecules for which diffraction quality crystals have been obtained. These include proteins, protein:protein complexes, nucleic acid, nucleic acid:nucleic acid complexes, protein:nucleic acid complexes, and viruses. ( Holbrook, S. R., S. M. Muskal, et al. (1993). Predicting Protein Structural Features with Artificial Neural Networks. Artificial Intelligence and Molecular Biology. L. Hunter, ED. Menlo Park, CA, AAAI Press: 161-194.)
[6] D. Brutlag, Understanding the Human Genome. Scientific American Introduction to Molecular Medicine. P. Leder, D. A. Clayton and E. Rubenstein, ED. New York, NY, Scientific American, Inc., 1994: p. 159.
[7] E. A Feigenbaum and N. Martin, Proposal: MOLGEN - a computer science application to molecular genetics, Heuristic Programing Project, Stanford University, Technical Report No: HPP-78-18,1977.
[8] P. Friedland, Knowledge-Based Experiment Design in Molecular Genetics. Ph.D. Thesis, Computer Science, Stanford University, Stanford,1979.
[9] E. A., Feigenbaum, B. Buchanan, et al. A Proposal for Continuation of the MOLGEN Project: A Computer Science Application to Molecular Biology, Computer Science Department, Stanford University, Heuristic Programming Project, Technical Report No. HPP-80-5, April, 1980, Section 1., p.1.
[10] Douglas Brutlag, personal communication. Peter Friedland, personal communication. After his work on MOLGEN and at IntelliGenetics (discussed below) Friedland went on to become chief scientist at the NASA-Ames Laboratory for Artificial Intelligence in 1987.
[11] L.J. Korn, C.L. Queen, and M.N. Wegman. “Computer Analysis of Nucleic Acid Regulatory Sequences.” Proceedings of the National Academy of Sciences 74 (1977): 4516-4520; R. Staden. “Sequence Data Handling by Computer.” Nucleic Acids Research 4 (1977): 4037-4051; R. Staden. “Further Procedures for Sequence Analysis by Computer.” Nucleic Acids Research 5 (1978): 1013-1015; R. Staden. “A Strategy of DNA Sequencing Employing Computer Programs.” Nucleic Acids Research 6 (1979): 2602-2610.
[12] P. Friedland, D.L. Brutlag, J. Clayton, and L.H. Kedes. “SEQ: A Nucleotide Sequence Analysis and Recombinant System.” Nucleic Acids Research 10 (1982): 279-294.
[13] T. Rindfleisch, P. Friedland, and J. Clayton. The GENET Guest Service on SUMEX,SUMEX-AIM Report, 1981: Stanford University Special Collections, Friedland Papers, Fldr GENET.
[14] Doug Brutlag, Personal Communication, 6/19/99. Also discussed in the official site review for BIONET conducted by the NIH Special Study Section, March 17-19, 1983, “BIONET, National Computer Resource for Molecular Biology,” Stanford University Special Collections, Brutlag Papers, p. 2. Also discussed in Roger Lewin, “National Networks for Molecular Biologists,” Science 223 (1984): 1379-1380.
[15] This was announced to the GENET community by Allan Maxam, the chairman of the national advisory board. See: Allan M. Maxam to GENET Community. Subject: Closing of GENET: August 23,1982. Stanford University Special Collections, Peter Friedland Papers, Fldr GENET.
[16] Business Plan for IntelliGenetics, May 8, 1981, p. 5. Stanford Special Collections, Brutlag Papers, Fld IntelliGenetics. Emphasis in the original. Details of the software licensing arrangement and the revenues generated are discussed in a letter to Niels Reimers, Stanford Office of Technology Licensing on the occasion of renegotiating the terms. See: Peter Friedland to Niels Reimers. Subject: Software Licensing Agreement: April 2,1984. Stanford University Special Collections, Fldr IntelliGenetics.
[17] Lewin noted that this was the largest award of its kind by NIH to a for-profit organization. See ibid., p. 1380.
[18] Minutes of the Meeting of the National Advisory Committee for BIONET, March 23, 1985 (Final version prepared 1 August 1985),p. 4. In Stanford University Special Collections, Brutlag Papers, Fld. BIONET.
[19] Joel Huberman. “BIONET: Computer Power for the Rest of Us.” (1989): p. 1.
[20] Peter Friedland, "BIONET Organizational Plans," 27 April 1984, Company Confidential Memo. Stanford University Special Collections, Brutlag Papers, Fldr BIONET, p. 1. A published version of these objectives appeared as: Dennis H. Smith, Douglas Brutlag, Peter Friedland, and Laurence H. Kedes, “BIONETtm: national computer resource for molecular biology,” Nucleic Acids Research, 14(1)(1986): 17-20.
[21] IntelliGenetics, Introduction to Bionettm: A Computer Resource for Molecular Biology, User manual for Bionet subscribers, Release 2.3, Mountain View, CA, IntelliGenetics, 1987, p. 23, “Databases available on BIONET.”
[22] Douglas Brutlag, Personal communication, June 19, 1999. Steve Boswell, “Los Alamos Workshop—Exploring the Role of Robotics and Automation in Decoding the Human Genome,” IntelliGenetics trip report, January 9, 1987. In Stanford Special Collections, Brutlag Papers, Fld. BIONET.
[23] Barbara H. Duke, Contracting Officer, NIH, to IntelliGenetics, Inc. "Request for Revised Proposal in Response to Request for Proposals RFP No. NIH-GM-87-04 entitled ‘Nucleic Acid Sequence Data Bank,’" June 3, 1987, Letter with attachment. Stanford Special Collections, Brutlag Papers, Fld. BIONET.
[24] Douglas L Brutlag, and David Kristofferson. “BIONET: An NIH Computer Resource for Molecular Biology.” Biomolecular Data: A Resource in Transition. Ed. R. R. Colwell. Oxford: Oxford University Press, 1988. 287-294. Also see, “Automatic Data Submission to GenBank, EMBL, and NBRF-PIR,” BIONET News, Vol 1, No. 1, April 1988.
[25] Ibid.
[26] Douglas Brutlag, Personal Communication, June 19, 1999. See nomination for Smithsonian-Computerworld Award in Stanford Special Collections, Brutlag Papers, Fld. Smithsonian Computerworld Award.
[27] Quoted from Steve Boswell, “Los Alamos Workshop—Exploring the Role of Robotics and Automation in Decoding the Human Genome,” IntelliGenetics trip report, January 9, 1987, p. 2. In Stanford Special Collections, Brutlag Papers, Fld. BIONET.
[28] H. Morowitz, Models for Biomedical Research: A New Perspective. Washington, D.C., National Academy of Sciences Press, 1985, p. 21.
[29] Ibid., p. 65.
[30] Ibid., p. 67.
[31] Board of Regents, NLM Long Range Plan (Report of the Board of Regents), Bethesda, MD, National Library of Medicine, (1987).
[32] Ibid., p.26.
[33] Ibid., p. 29.
[34] Ibid., pp. 46-47. The figures for Medical Informatics were $7.4, $9.9, and $13 Mil for 1888-90.
[35] Walter Gilbert, “Towards a Paradigm Shift in Biology,” Nature, 349 (1991), p. 99.
[36] Ibid.
[37] These are the disciplines graduate students and postdocs in molecular biology in Brutlag’s lab at Stanford are expected to work with. Source: Douglas Brutlag, “Department Review: Bioinformatics Group, Department of Biochemistry, Stanford University,1998,” personal communication.
[38] S. C. Hyde, P. Emsley, et al. (1990). “Structural Model of ATP-binding Proteins Associated with Cystic fibrosis, Multidrug Resistance and Bacterial Transport.” Nature 346: 362-365; B.S. Kerem, J. M. Rommens, et al. “Identification of the Cystic Fibrosis Gene: Genetic Analysis,” Science 245(1989): 1073-1080; B. S. Kerem, J. Zielenski, et al. “Identification of Mutations in Regions Corresponding to the Two Putative Nucleotide (ATP)-Binding Folds of the Cystic Fibrosis Gene,” Proceedings of the National Academy of Sciences 87(1990): 8447-8451; J. R. Riordan, J. M. Rommens, et al., “Identification of the Cystic Fibrosis Gene: Cloning and Characterization of Complementary RNA,” Science 245(1989): 1066-1073.
[39] Hyde, Emsley, et al. used the Chou-Fasman algorithm (1973)for identifying consensus sequences and the Quantatm modeling package produced by Polygen Corp., Waltham, Mass. for modeling the protein and its binding sites. See, S. C. Hyde, P. Emsley, et al. (1990). “Structural Model of ATP-binding Proteins Associated with Cystic fibrosis, Multidrug Resistance and Bacterial Transport.” Nature 346: 362-365.
[40] D. Brutlag, “Understanding the Human Genome” Scientific American Introduction to Molecular Medicine. P. Leder, D. A. Clayton and E. Rubenstein, Eds. New York, NY, Scientific American, Inc., 1994: pp. 164-166.
[41] See for instance the National Institute of General Medical Science, “(NIGMS), Protein Structure Initiative Meeting Summary,” April 24, 1998, at:
http://www.nih.gov/nigms/news/reports/protein_structure.html
[42] I have focused on the development of software in this discussion. But a further crucial stimulation to the takeoff of bioinformatics, of course, are hardware and networking developments. The growth of databases and complexity of the searches that were to be undertaken stimulated the demand for faster algorithms, more powerful computer systems, and network bandwidth. At the beginning of this “bioinformatics revolution” in the 1970s, for example, a search on a DNA sequence of typical size would be performed by a computer capable of performing one million instructions per second (one MIP) and would take approximately 15 minutes. Throughout the late 1970s and 1980s mini-computers and personal computer workstations continued to increase in power at about the same rate as the growth of the databases, so that a typical search still took around 15 minutes. By the end of the 1980s, however, the growth in sequence data—now hundreds of megabytes in size—had overtaken the ability of computers to search it with acceptable turnaround time. Shortcut search methods and more efficient code helped, but the most rigorous and sensitive searches began to require hours of computing time to align and score even a single query sequence against a database of sequences. The NIH and NSF responded to the challenge by supporting research and development of new computer architectures, regional supercomputer centers and several large-scale computing initiatives. (see Thomas P. Hughes, et al., ed., Funding a Revolution: Government Support for Computing Research, Washington, D.C., National Academy Press, 1999.) Commercial vendors such as DEC, SUN Microsystems, Cray Computers, and MasPar Computer Corporation tried to meet the large-scale computing needs of geneticists with, for example, massively parallel computers, such as the MasPar MP-1 computer. In early 1992, the MasPar MP-1104 with 4,096 processors could search the entire Swiss-Protein database in 30 seconds with a query of 100 amino acids, and a query of 1000 amino acids could be executed on the GenBank database (74,000 sequences) in 15 minutes. (see IntelliGenetics, Inc., and MasPar Computer corporation, “BLAZE: A Massively Parallel Sequence Similarity Search Program for Molecular Biologists,” Product Information Bulletin, May 1992.)
[43] Bruno Latour and Steve Woolgar, Laboratory Life: The Construction of Scientific Facts, Princeton; Princeton University Press, 1979, second edition, 1986.