[Back to Doug] [Address] [Academics] [Honors] [Publications] [Presentations] [Public Service]

Wu, T. D., Nevill-Manning, C. G. and Brutlag, D. L. (2000). Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics, 16, 1-12.

Fast probabilistic analysis of sequence function using scoring matrices

T. D. Wu, C. G. Nevill-Manning & D. L. Brutlag

Department of Biochemistry, Stanford University School of Medicine, Stanford, California 94305-5307.

Motivation: We present techniques for increasing the speed and accuracy of sequence analysis using scoring matrices. Results: The speed of sequence analysis is increased by three techniques: significance filtering, lookahead scoring, and permuted lookahead scoring. These techniques require that the user specify a significance threshold, which indicates the relative sensitivity or specificity desired for a particular sequence analysis. In significance filtering, the program reports only those segments that exceed the a score threshold corresponding to the given significance threshold. In lookahead scoring, the program can terminate early from the process of scoring each segment, by comparing intermediate scores with intermediate score thresholds. Permuted lookahead scoring scores each segment in a particular order that maximizes the likelihood of terminating scoring early. Both lookahead scoring tech-niques substantially reduce the number of residues that must be examined. The fraction of residues examined ranges from 62%to6%, depending on the significance threshold chosen by the user. Accuracy is increased by calculating the statistical significance of hits under a Markov assumption. We develop a method for computing the probability mass function and the quantile function for a scoring matrix using Markov frequencies. We observe that Markov assumptions tend to raise the p value of segments, when compared with the independence assumption, by an average ratio of 1.30 for a first-order Markov assumption and 1.69 for a second-order Markov assumption. The techniques described above are implemented in a system called ematrix, and a set of scoring matrices has been precompiled into a database called recognize. The speed of ematrix is several times faster than existing programs. At a significance threshold of 10^ -6,ematrix reaches speeds of 225 residues/second. At a significance threshold of 10^-20 , its speed rises to 541 residues/second.

Full Text: To get Acrobat:

[Back to Doug] [Address] [Academics] [Honors] [Publications] [Presentations] [Public Service]