[Back to Doug] [Address]
[Academics] [Honors]
[Publications] [Presentations]
[Public Service]

Wu, T. D., Nevill-Manning, C. G. and Brutlag, D. L. (2000). Fast
probabilistic analysis of sequence function using scoring matrices.
Bioinformatics, 16, 1-12.

## Fast probabilistic analysis of sequence function using scoring
matrices

#### T. D. Wu, C. G. Nevill-Manning &
D. L.
Brutlag

Department of
Biochemistry, Stanford
University School of Medicine, Stanford,
California 94305-5307.

Motivation: We present techniques for increasing the speed and
accuracy of sequence analysis using scoring matrices. Results: The
speed of sequence analysis is increased by three techniques:
significance filtering, lookahead scoring, and permuted lookahead
scoring. These techniques require that the user specify a
significance threshold, which indicates the relative sensitivity or
specificity desired for a particular sequence analysis. In
significance filtering, the program reports only those segments that
exceed the a score threshold corresponding to the given significance
threshold. In lookahead scoring, the program can terminate early from
the process of scoring each segment, by comparing intermediate scores
with intermediate score thresholds. Permuted lookahead scoring scores
each segment in a particular order that maximizes the likelihood of
terminating scoring early. Both lookahead scoring tech-niques
substantially reduce the number of residues that must be examined.
The fraction of residues examined ranges from 62%to6%, depending on
the significance threshold chosen by the user. Accuracy is increased
by calculating the statistical significance of hits under a Markov
assumption. We develop a method for computing the probability mass
function and the quantile function for a scoring matrix using Markov
frequencies. We observe that Markov assumptions tend to raise the p
value of segments, when compared with the independence assumption, by
an average ratio of 1.30 for a first-order Markov assumption and 1.69
for a second-order Markov assumption. The techniques described above
are implemented in a system called ematrix, and a set of scoring
matrices has been precompiled into a database called recognize. The
speed of ematrix is several times faster than existing programs. At a
significance threshold of 10^ -6,ematrix reaches speeds of 225
residues/second. At a significance threshold of 10^-20 , its speed
rises to 541 residues/second.

**Full
Text:**
**To
get Acrobat: **
[Back to Doug] [Address]
[Academics] [Honors]
[Publications] [Presentations]
[Public Service]