Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison

Time Warps, String Edits and Macromolecules is a young classic in computational science, scientific analysis from a computational perspective. The computational perspective is that of sequence processing, in particular the problem of recognizing related sequences. The book is the first, and still best compilation of papers explaining how to measure distance between sequences, and how to compute that measure effectively. This is called string distance, Levenshtein distance, or edit distance. The book contains lucid explanations of the basic techniques; well-annotated examples of applications; mathematical analysis of its computational (algorithmic) complexity; and extensive discussion of the variants needed for weighted measures, timed sequences (songs), applications to continuous data, comparison of multiple sequences and extensions to tree-structures. In molecular biology the sequences compared are the macromolecules DNA and RNA. Sequence distance allows the recognition of homologies (correspondences) between related molecules. One may interpret the distance between molecular sequences in terms of the mutations necessary for one molecule to evolve into another. A further application explores methods of predicting the secondary structure (chemical bonding) of RNA sequences. In speech recognition speech input must be compared to stored patterns to find the most likely interpretation (e.g., syllable). Because speech varies in tempo, part of the comparison allows for temporal variation, and is known as “time-warping”. In dialectology Levenshtein distance allows analysis of the learned variation in pronunciation, its cultural component. Levenshtein distance introduces a metric which allows more sophisticated analysis than traditional dialectology's focus on classes of alternative pronunciations. A similar application is the study of bird song, where degrees of distance in song are seen to correspond to the divergence of bird populations. A final application area is software, where Levenshtein distance is employed to located differing parts of different versions of computer files, and to perform error correction.

David Sankoff is a professor in the Department of Mathematics and Statistices at the University of Montreal. Joseph Kruskal works at the Mathematics and Statistics Research Center at Bell Laboratories

John Nerbonne is a professor at Alfa-informatica, BCN, University of Groningen.

- 1 An Overview of Sequence Comparison Joseph B. Kruskal
  - 1 Introduction
  - 2 How sequences differ
  - 3 Analysis of differences: trace, alignment, and listing
  - 4 Levenshtein and other distances
  - 5 Calculating distance: the basic algorithm
  - 6 Computational complexity of sequence comparisons
  - 7 How sequence comparison is used
  - 8 Why should distances be used?
  - 9 Why should Levenshtein be used?
  - References
I Macromolecular Sequences
- 1 Introduction
- 2 Advantages of the Dynamic Programming Method
- 3 Biological Background
- References
- 2 Recognition Patterns in Genetic Sequences Bruce W. Erikson and Peter H. Sellers
  - 1 Evolutionary distance and metric alignment
  - 2 Two pattern-recognition theorems
  - 3 Recognition of pattern s in sequence 1
  - 4 Recognition of a pattern in two DNA sequences
  - 5 A metric set of genetic distances between amino acids
  - 6 Recognition of a cyclically permuted pattern in two protein sequences
  - 7 Recognition of a potential translocation signal in a protein sequence
  - 8 Acknowledgements
  - References
- 3 Fast Algorithms to Determine RNA Secondary Structure Containing Multiple Loops David Sankoff, Joseph B. Kruskal, Sylvie Mainville, and Robert J. Cedergren
  - 1 Introduction
  - 2 Constraints on secondary structure
  - 3 Loops
  - 4 The thermodynamics of a structure
  - 5 Sequence comparison and self-comparison
  - 6 A recursion
  - 7 The basic algorithm
  - 8 An n³ algorithm
  - 9 An n⁴ algorithm
  - 10 A heuristic extension algorithm for longer sequences
  - References
II Time-Wrapping, Continuous Functions, and Speech Processing
- 1 Introduction
- References
- 4 The Symmetric Time-Wrapping Problem: From Continuous to Discrete Joseph B. Kruskal and Mark Liberman
  - 1 Introduction
  - 2 Time-wrapping in the continuous case
  - 3 Time-wrapping in the discrete case
  - 4 Distance and length in the continuous case
  - 5 Distance and length in the discrete case
  - 6 How compression-expansion differs from deletion-insertion
  - 7 Combining deletion-insertion and compression-expansion in a single method
  - 8 Interpolation between the sampling points
  - 9 Average of two trajectories
  - 10 Average of two sequences
  - References
- 5 Use of Dynamic Programming in a Syllable-Based Continuous speech Recognition System Melvyn J. Hunt, Matthew Lenning, and Paul Mermelstein
  - 1 Introduction
  - 2 Speech recognition as a string-matching problem
  - 3 Approaches to the problem
  - 4 The syllable-based system
  - 5 The syllable comparator
  - 6 System performance
  - 7 Conclusions
  - References
- 6 Application of Sequence Comparison to the Study of Bird Songs David W. Bradley and Richard A. Bradley
  - 1 Introduction
  - 2 Methods for coding the songs
  - 3 Determining inter-note distrances
  - 4 Algorithms for sequence comparison
  - 5 Comparative evaluationg of the methods
  - 6 Application of the method
  - 7 Discussion
  - 8 Conclusion
  - 9 Acknowledgements
  - References
    - Appendix 1 APL programs
    - Appendix 2 Statistical Methods
III Variations on a Theme: Algorithms for Related Problems
- 1 Introduction
- 2 Finding a pattern in a sequence
- 3 Finding similar portions in two sequences
- 4 Comparing several sequences
- 5 Comparing one sequence with itself
- 6 Comparing two trees
- 7 Comparing two directed networks
- 8 Comparing two continuous functions permitting compression and expansion
- 9 Permitting transpositions
- 10 Permitting generalized substituions
- 11 Comparing under constraints
- 12 Generalizing Levenshtein
- References
- 7 On the Complexity of the Extended String-to-String Correction Problem Robert A. Wagner
  - 1 Introduction
  - 2 The problem
  - 3 Cuts of traces
  - 4 Cellars and paths
  - 5 Subgraphs of G
  - 6 CELLAR algorithm
  - 7 Proof of the algorithm
  - 8 Special algorithms for special sets of w_x values
  - 9 An NP-complete ESSCP
  - 10 A generalization
  - References
- 8 An analysis of the General Tree-Editing Problem Andrew S. Noetzel and Stanley M. Selkow
  - 1 Introduction
  - 2 Definitions
  - 3 Insertions and deletions restricted to leaves
  - 4 An algorithm for tree distance when insertions and deletions are restricted to leaves
  - 5 The case of general insertions and deletions
  - 6 An algorithm for tree-edit distance in the case of general insertions and deletions
  - 7 Conclusions
  - References
- 9 Simultaneous Comparison of Three or More Sequences Related by a Tree David Sankoff and Robert J. Cedergren
  - 1 Introduction
  - 2 Tree alignments
  - 3 An algorithm to calculate tree distances
  - 4 A dynamic-programming algorithm for the inner minimization
  - References
- 10 An Anthology of Algorithms and Concepts for Sequence Comparison Joseph B. Kruskal and David Sankoff
  - 1 Directed networks
  - 2 Level-building in recognition of continuous speech
  - 3 Cutting corners to improve calculation time
  - 4 Deletion and insertion constraints
  - 5 A biological application of insertion and deletion constraints: 5S RNA from humans and E. coli
  - 6 Finding similar portions of two sequences
  - 7 Weights for insertion and deletion of consecutive strings
  - 8 Generalized substituions: Quartic-, cubic-, and quadratic-time algorithms
  - 9 Consistency and metric properties: some conditions
  - References
- 11 Dissimilarity Measures for Clustering Strings James M. Coggins
  - 1 Background
  - 2 String dissimilarity measures
  - 3 Improved string dissimilarity measures
  - 4 Examples
  - 5 Conclusion
  - References
IV Computational Complexity
- 1 Introduction
- 12 Recent Results on the Complexity of Common-Subsequence Problems Daniel S. Hirschberg
  - 1 Introduction
  - 2 NP-completeness results
  - 3 Polynomial algorithms
  - 4 Lower bounds
  - 5 Other algorithms for the 2-LCS problem
  - 6 An open problem
  - References
- 13 Formal-Language Error Correction Robert A. Wagner
  - 1 Introduction
  - 2 Error processing
  - 3 Optimal error correction
  - 4 Some results
  - 5 The context-sensitive cas and the maximum independent set problem
  - 6 A graph-theoretical formulation
  - References
- 14 How to Compute String-Edit Distances Quickly William J. Masek and Michael S. Paterson
  - 1 Introduction
  - 2 A faster algorithm
  - 3 Longest common subsequence
  - 4 Sparseness is necessary
  - 5 When to use the faster algorithm
  - 6 Conclusion
  - Acknowledge
  - References
V Random Sequences
- 1 Introduction
- References
- 15 An Upper-Bound Technique for Length Václav Chvatál and David Sankoff
  - 1 Introduction
  - 2 An unexpected invariant for length-n sequences
  - 3 A bound on the occurrences of highly similar sequences
  - 4 Calculating expected values
  - References
- 16 Probabilistic Behavior of Longest-Common-Subsequences Length Joseph Deken
  - 1 Probabilistic models for random sequences
  - 2 Upper bounds
  - 3 Upper bounds for c'(k, p, n)
  - 4 More extensive precedence rules
  - References
- 17 Common Subsequences and Monotone Subsequences David Sankoff and Sylvie Mainville
  - 1 Common subsequences
  - 2 Monotone subsequences
  - References
Author Index
Subject Index

October 2001

ISBN (Paperback): 1575862174 (9781575862170)
ISBN (electronic): 1575867117 (9781575867113)

Time Warps, String Edits, and Macromolecules

Contents