Accepting the Gordium challenge

Timothy D. Fenn

< >

version 1.0 (Mon Sep 20 13:50:13 PDT 2004)

1. The Gordian knot

Crystallography, to many outside its realm, is considered an esoteric branch of science. Each step in the process of solving an atomic structure of a particular molecule (regardless of size) cannot be simply described - it is therefore typically left as a vaguely defined phenomenon (usually following something along the lines of "grow crystals, obtain a diffraction pattern using X-rays and use that pattern to generate an atomic model") to be understood only by the trained few.

The curiousity left by these sorts of definitions is not abetted by the manner with which most of the experiment is carried out - every step following crystal growth is handled by computer driven processes. This raises the level of abstraction in crystallography to a point where even crystallographers themselves are unsure of some of the mathematics and statistics involved.

This has led to a field that is very powerful in answering questions regarding biologically relevant topics, but also to a form of laissez faire science - the methods used to achieve the end (i.e. atomic structure) are irrelevant, just so long as those methods work (on a reasonable time scale with little user intervention) within the context of the scientific questions asked. In terms of software design, the repurcussions of this approach are especially felt in a field so intertwined with computers, as many crystallographic programs are closed source efforts but are widely used as they "get the job done." This appears to be in direct violation of the scientific notion that all information relevant to the experiment (in this case, the source code upon which the software is based) is shared, thereby restricting the ability of peer review and perhaps questioning the warrant under which some crystallographic programs operate.

Given this, is the closed source approach the most appropriate one for the crystallographic community? If so, have we violated any of the fundamental mantras (e.g. necessity of peer review and the heuristical nature of research) of science in the process? On the other hand, how can/should the software development process be sustained that allows one to define/protect one's own research initiatives (or profitability) in the face of open source development? This essay is an attempt to address how one can begin to answer these questions, and use these answers to develop aphorisms involving scientific programming. It should be noted that although many of the examples used here are in terms of structural biology/crystallography (a result of my own naive training), the ideas expressed here should be extensible to academic software development as a whole - and that much of what is stated in this document is simply a reiteration (albeit in a different form) of what people far smarter than myself (mostly Eric Raymond, but also Edward O. Wilson, Frederic Brooks, Alan Cox and others) have offered on like topics.

2. Should scientific software be closed source?

While I won't go into detail regarding open vs. closed source software development in general (please read either [1] or [2] as an excellent basis for this), it is sufficient to say there are several discriminators involved that make an open source venture more appropriate (from [2]):

  1. reliability/stability/scalability are critical
  2. correctness of design and implementation cannot readily be verified by means other than independent peer review
  3. the software is critical to the user's control of his/her business/research
  4. the software establishes or enables a common computing and communications infrastructure
  5. key methods (or functional equivalents of them) are part of common engineering knowledge

Each of these points can be considered on a per-program basis, although I will be so bold to entertain (b) as the most important in the case of scientific programs in general. The reason I say so is that it is fundamentally linked to the following (partial) definition of science as a whole:

"The diagnostic features of science that distinguish it from pseudoscience are first, repeatability: The same phenomenon is sought again, preferably by independent investigation, and the interpretation given to it is confirmed or discarded by means of novel analysis and experimentation." [3]

This clearly states that any scientific program should be verified by means other than independent peer review, but the closed source approach precludes this assertion (as it is the complete antithesis of (b)). Therefore, if results from two programs (which may or may not use the same mathematical principles to achieve their respective ends) differ, the closed source strategy makes this problem a difficult one to solve (how do you reconcile the differences in the absence of understanding?). Further,this leaves scientists in the position of defending results on the basis of programs that may only be applicable in certain instances (instances which may be unbeknownst to the user).

The molecular surface program GRASP (Graphical Representation and Analysis of Structural Properties) illustrates this sort of problem [4]. GRASP generates molecular surfaces from atomic coordinates, which can then be colored according to particular atomic properties such as electrostatic charge. Generating this electrostatic field requires solving an equation that takes several parameters - parameters which are dependent on the environment at hand. In the absence of source code (as is the case with GRASP), the parameters to the equation can not be altered to suit non-standard cases. This makes interpretation a tricky enterprise, and further makes conclusions based on these results a matter of caveat emptor - clearly a bad thing in a scientific world in search of immutability.

However, many other aspects of GRASP are easily verified without access to the source code, as the program can also color the surface according to curvature. It is therefore only a matter of close inspection to verify whether the resultant values make sense in the context of the system.

As implied with GRASP, it seems that as long as the calculation or process carried out by the software can be inspected in some fashion, then the closed source model can still be applied. For example, visualizing and altering atomic models can be done with such programs as O [5], which simply require a graphical application to display the molecule and allow commands which modify it so that it agrees better with the experimental observations (such as the measurements derived from crystallographic analysis). In this sense, the closed source argument is allowable (and followed in the case of O), as condition (b) is reduced to a trivial one.

However, the definition of science does not stop at repeatability alone. Dr. Wilson further states the importance of heuristics:

"The best science stimulates further discovery, often in unpredictable new directions; and the knowledge provides an additional test of the original principles that led to its discovery." [3]
which, when combined with repeatability, appears like a scientific version of the following open source observation (with my own supplanting information added):
"Good programmers [scientists] know what to write [research]. Great ones know what to rewrite [retest] (and reuse [further study])." [1]
Put another way, although (b) may be the most important condition for open source scientific software based on general scientific principles, (c) plays an increasingly important role as other scientists interested in derivative approaches wish to implement their own ideas. An open source model provides an excellent basis to do so, as it does not require the scientist to start entirely from scratch every time a "new direction" is sought.

To go back to the O example, visualizing crystallographic data is necessary when making decisions about how to move/rotate/remove/add parts of a molecule during the experimental process. O implements an algorithm that was originally designed in the late 1960's for the purpose of visualization (a fact which is only known by analyzing the visual results and applying induction) - however, many other more modern (and potentially useful) approaches are beginning to arise [6]. It would be of scientific interest to implement some of these more modern ideas using O as a springboard, but the closed source nature of O requires scientists to look elsewhere. This is to the loss of not just the interested scientists, but to the O developers and crystallographers as a whole.

This brings up another beneficial point of open source design combined with scientific heuristics, but a tricky one to follow in practice:

"Provided the development coordinator has a communications medium at least as good as the Internet, and knows how to lead without coercion, many heads are inevitably better than one." [1]
In scientific words, having more researchers contributing to a scientific program is a good thing, just so long as the initiative is properly coordinated. This not only leads to more productive and scientifically useful programs (in turn generating a larger user base), but further enhances the reputation of those involved (not just in the context of the hacker subculture [7], but in terms of scientific notoriety).

This concept is best captured by several current crystallographic collaborations (e.g. cctbx, CCP4, CNS, etc), including the only one with which I have personal experience - POVScript+. The consummating event in any scientific endeavor is publication of the results in a peer reviewed journal. In the case of crystallographic structures, the resultant atomic model is displayed in a fashion that details its more salient features. As with any display, the most aesthetically pleasing one is desired. However, in the case of science, may be rather difficult to combine aesthetics with appropriate scientific content. This is typically a result of limited programs and/or conceptualization problems - e.g. "How do I present a six dimensional property in a way that looks good yet gets the information across on a two dimensional piece of paper?" I was faced with this problem after determining a rather detailed structure of a molecule - I had several aspects of the molecule and the crystallographic data that I wanted to present in one figure (for the gearheads reading this, I wanted to illustrate atomic disorder as defined by refinement of anisotropic dispersion tensors to show how well they agreed with the electron density maps). This proved an overly difficult task given the available programs, and even when I did manage to make some images I wasn't pleased with the results.

What to do? I wanted to publish my work, but I couldn't even find a way to present it adequately (on a tangent note, this reminds me of the following: "Every good work of software starts by scratching a developer's personal itch" [1]). Rather than designing an entire software package, I took one of the existing programs (MolScript [8]) that had the source code available, and modified it to do what I wanted [6]. Largly thanks to Per Kraulis' design (the author of MolScript) and its open source foundation, my implementation only took a few weeks of spare time to complete. In the end, I got the results I was looking for [9], an additional publication based on my modifications to MolScript [6], and Per Kraulis is all the better for it - although I certainly don't claim my modification to MolScript is popular, it only stands to increase Per Kraulis' user base and reputation as a scientist.

3. Freeing the ox cart

In the face of an open source model, scientific programming is hampered by two questions:

  1. If I go open source, how can I make money?
The first question is most often asked by advocates of university technology transfer programs and commercial entities that typically think of scientific programs as a sale value based good. It has been demonstrated elsewhere that this assumption is false (the majority of development involves code maintenance, and thus software should be considered a use value based good), and will not be belabored here [2]. Further, it is of no good to differentiate amongst the user base to decide if a sale value is justified (i.e. "free for academic and non-profit"), as users are then placed in a group they may not be localized or restrained in (e.g. if I move from an academic to commercial position, can I write my own software using any of the things I learned from reading your source?).

Suffice it to say the sale value based model is problematic, and is hard to define in a clear-cut manner. The best (open source based) model is one in which the source is freely available (thusly avoiding the quagmire of what is a derivative product, who owns or deserves payment for the product or any derivative work, etc.) and relying on services as the revenue generator. This can be epitomized by developers selling support contracts that cover periods of time (e.g. Red Hat Linux), which can cover anything from patches/updates to technical support (potentially including questions regarding the code itself). Alternatively, services can be provided as subscriptions to a web-based interface of the program, or even as selling CDs/documentation/manuals.

Of course, there are other possibilities to turning a profit, mostly dependent upon the tool involved. For further examples of such "indirect sale-value" models, see http://www.catb.org/~esr/writings/magic-cauldron/magic-cauldron-9.html, part of [2].

Regardless of the benefits of an open source model, concerned experimentors may not want to reveal the inner workings of their software as it represents the basis of their research. This leads to question 2:

  1. How do I protect my own research?
The roots of this question lie in the concern that research should be an independently developed and thought out venture, and only when the task is completed should results be shared. Any deviation from this track could lead to a competitor scooping the same topic.

There are two methods by which this can be avoided in a simple manner:

Of the two, open source advocates would certainly prefer the former, as the latter could be turned around as such: the hide-the-ball model forgets the Frederick Brooks notion that "more users find more bugs" (also referred to as Linus's law), as discussed above [1, 10]. However, cautious men of science realize any program not lacking rigorous testing should always bear the warning: "garbage in - garbage out."

Alexander the Great realized the Gordian knot was purely a problem of perception ("What does it matter how I loose it?") - this is also how the authors of the essays I have summarized here envision software development as a whole. Scientists are beginning to realize this through a concerted effort towards an open source foundation (e.g. http://www.openinformatics.org), but it is still under debate as to how this can/should be implemented (particularly given the complications of the Bayh-Dole act (USC Title 35, Part II, Chap. 18, Sec. 200), which makes it an "objective of the Congress to use the patent system to promote the utilization of inventions arising from federally supported research or development"). In the meantime, it seems that the best approach to scientific programming is based on projects which scientists can not only build upon, but can be used as a teaching tool for those outside its esoteric realm.

Bibliography

1. Raymond, E.S. The Cathedral and the Bazaar. http://www.catb.org/~esr/writings

2. Raymond, E.S. The Magic Cauldron. http://www.catb.org/~esr/writings

3. Wilson, E.O. 1998. Consilience. Vintage Books, New York.

4. Nicholls, A., Sharp, K. and Honig, B. 1991. Protein Folding and Association: Insights From the Interfacial and Thermodynamic Properties of Hydrocarbons. PROTEINS, Structure, Function and Genetics. 11(4), 281.

5. Jones, T.A., Zou, J.Y., Cowan, S.W. and Kjeldgaard, M. 1991. Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst A47, 110-119.

6. Fenn, T.D., Ringe, D. and Petsko, G.A. 2003. POVSCript+: A Program for Model and Data Visualization using Persistence of Vision Raytracing. J. Appl. Cryst. 36:944-947. (see also http://www.stanford.edu/~fenn/povscript)

7. Raymond, E.S. Homesteading the Noosphere. http://www.catb.org/~esr/writings

8. Kraulis, Per J. 1991. MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24:946-950.

9. Fenn, T.D., Ringe, D. and Petsko, G.A. 2004. Xylose Isomerase in Substrate and Inhibitor Michaelis States: Atomic Resolution Studies of a Metal-Mediated Hydride Shift. Biochemistry 43(21):6464 - 6474.

10. Brooks, F.P. 1995. The Mythical Man-Month. Addison-Wesley Longman, Inc., Reading, MA.