Naor, D. and Brutlag, D. L. (1994). On Near-Optimal Alignments of Biological Sequences. J. Computational Biology, 1(4), 349-366.

Department of Biochemistry, Stanford University School of Medicine, Stanford California 94305-5307.

It is widely accepted that the optimal alignment between a pair of
proteins or nucleic acid sequences that minimizes the edit distance
may not necessarily reflect the correct biological (evolutionary or
structural) alignment. Alignments of proteins based on their
structures or of DNA sequences based on evolutionary changes are
often different from alignments that minimize edit distance. However,
in many cases (e.g. when the sequences are very similar), the edit
distance alignment is a good approximation to the biological one.
Since, for most sequences, the true alignment is unknown, a method
that either assesses the significance of the optimal alignment, or
that provides few "close" alternatives to the optimal one, is of
great importance.

A near-optimal alignment is an alignment whose score lies within the
neighborhood of the optimal score. Enumeration of near-optimal
alignments (Waterman, 1983; Waterman and Byers, 1985) is not very
practical since there are many such alignments. Other approaches
(Vingron and Argos, 1989; Vingron and Argos, 1990; Zuker, 1991) that
use only partial information about near-optimal alignments are more
successful in practice.

We present a method for representing all alignments whose score is
within any given delta from the optimal score. It represents a large
number of alignments by a compact graph which makes it easy to impose
additional biological constraints and select one desirable alignment
from this large set. We study the combinatorial nature of
near-optimal alignments, and define a set of "canonical" near-optimal
alignments. We then show how to efficiently enumerate near-optimal
alignments in order of their score, and count their numbers.

These graphic representations can show, in a dramatic way, all the
possible alignments without the need to enumerate them. When applied
to comparisons of two distantly related proteins, they reveal that
the most highly conserved regions among all the near-optimal
alignments are the highly structured regions (a-helices and
b-strands) and the loops are the least highly conserved. This
phenomena is best exemplified by the graphic representation of
near-optimal alignments between the heavy and the light chain of
immunoglobulins. We have demonstrated that the regions of maximum
ambiguity among the near-optimal alignments between these two
sequences are their hypervariable regions, or loops.

Although the number of near-optimal alignments grows exponentially,
even when closely related sequences are compared, for each pair of
sequences, a specific log-odds amino acid replacement matrix (PAM
matrix) shows the fewest near-optimal alignments. This specific PAM
matrix correlates with the matrix that gives the optimal score in the
sequence alignment between two sequences. This observation may
provide a correlation between parametric approaches to sequence
alignment which optimize a score, and our graphic method which
minimizes the number of near-optimal alignments.

Since alignments are essentially paths in a directed acyclic graph
with (possibly negative) weights on its edges, our solution gives an
extremely simple method to enumerate all K shortest (or longest)
paths from s to t in such graphs in increasing order, as well as all
(s, t) paths that are within d of the optimum, for any d. Our
solution is as good, in terms of running time, as the previously
known solutions to the K-best shortest paths in such graphs, but
improves the space requirement by a factor of m when K>n. It also
improves the time and space needed to count the number of
near-optimal paths.

[Back to Doug] [Address] [Academics] [Honors] [Publications] [Presentations] [Public Service]