Naor, D. and Brutlag, D. L. (1994). On Near-Optimal Alignments of Biological Sequences. J. Computational Biology, 1(4), 349-366.
Department of Biochemistry, Stanford University School of Medicine, Stanford California 94305-5307.
It is widely accepted that the optimal alignment between a pair of
proteins or nucleic acid sequences that minimizes the edit distance
may not necessarily reflect the correct biological (evolutionary or
structural) alignment. Alignments of proteins based on their
structures or of DNA sequences based on evolutionary changes are
often different from alignments that minimize edit distance. However,
in many cases (e.g. when the sequences are very similar), the edit
distance alignment is a good approximation to the biological one.
Since, for most sequences, the true alignment is unknown, a method
that either assesses the significance of the optimal alignment, or
that provides few "close" alternatives to the optimal one, is of
A near-optimal alignment is an alignment whose score lies within the neighborhood of the optimal score. Enumeration of near-optimal alignments (Waterman, 1983; Waterman and Byers, 1985) is not very practical since there are many such alignments. Other approaches (Vingron and Argos, 1989; Vingron and Argos, 1990; Zuker, 1991) that use only partial information about near-optimal alignments are more successful in practice.
We present a method for representing all alignments whose score is within any given delta from the optimal score. It represents a large number of alignments by a compact graph which makes it easy to impose additional biological constraints and select one desirable alignment from this large set. We study the combinatorial nature of near-optimal alignments, and define a set of "canonical" near-optimal alignments. We then show how to efficiently enumerate near-optimal alignments in order of their score, and count their numbers.
These graphic representations can show, in a dramatic way, all the possible alignments without the need to enumerate them. When applied to comparisons of two distantly related proteins, they reveal that the most highly conserved regions among all the near-optimal alignments are the highly structured regions (a-helices and b-strands) and the loops are the least highly conserved. This phenomena is best exemplified by the graphic representation of near-optimal alignments between the heavy and the light chain of immunoglobulins. We have demonstrated that the regions of maximum ambiguity among the near-optimal alignments between these two sequences are their hypervariable regions, or loops.
Although the number of near-optimal alignments grows exponentially, even when closely related sequences are compared, for each pair of sequences, a specific log-odds amino acid replacement matrix (PAM matrix) shows the fewest near-optimal alignments. This specific PAM matrix correlates with the matrix that gives the optimal score in the sequence alignment between two sequences. This observation may provide a correlation between parametric approaches to sequence alignment which optimize a score, and our graphic method which minimizes the number of near-optimal alignments.
Since alignments are essentially paths in a directed acyclic graph with (possibly negative) weights on its edges, our solution gives an extremely simple method to enumerate all K shortest (or longest) paths from s to t in such graphs in increasing order, as well as all (s, t) paths that are within d of the optimum, for any d. Our solution is as good, in terms of running time, as the previously known solutions to the K-best shortest paths in such graphs, but improves the space requirement by a factor of m when K>n. It also improves the time and space needed to count the number of near-optimal paths.
[Back to Doug] [Address] [Academics] [Honors] [Publications] [Presentations] [Public Service]