COMPARISON OF METHODS FOR SEARCHING PROTEIN-SEQUENCE DATABASES

被引:222
作者
PEARSON, WR
机构
[1] Department of Biochemistry, University of Virginia, Charlottesville, Virginia
关键词
BLAST; FASTA; PAM250; SEQUENCE SIMILARITY; SMITH-WATERMAN;
D O I
10.1002/pro.5560040613
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (ln()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With ln()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or ln()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between PASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and ln()-scaling.
引用
收藏
页码:1145 / 1160
页数:16
相关论文
共 30 条
[1]   AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[2]   A PROTEIN ALIGNMENT SCORING SYSTEM SENSITIVE AT ALL EVOLUTIONARY DISTANCES [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR EVOLUTION, 1993, 36 (03) :290-300
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]   AN EXTREME VALUE THEORY FOR SEQUENCE MATCHING [J].
ARRATIA, R ;
GORDON, L ;
WATERMAN, M .
ANNALS OF STATISTICS, 1986, 14 (03) :971-993
[5]   THE SWISS-PROT PROTEIN-SEQUENCE DATA-BANK [J].
BAIROCH, A ;
BOECKMANN, B .
NUCLEIC ACIDS RESEARCH, 1991, 19 :2247-2248
[6]   PROSITE - A DICTIONARY OF SITES AND PATTERNS IN PROTEINS [J].
BAIROCH, A .
NUCLEIC ACIDS RESEARCH, 1991, 19 :2241-2245
[7]  
BARKER WC, 1990, METHOD ENZYMOL, V183, P31
[8]   FLEXIBLE PROTEIN-SEQUENCE PATTERNS - A SENSITIVE METHOD TO DETECT WEAK STRUCTURAL SIMILARITIES [J].
BARTON, GJ ;
STERNBERG, MJE .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 212 (02) :389-402
[9]  
CHAO KM, 1992, COMPUT APPL BIOSCI, V8, P481
[10]  
COLLINS JF, 1988, COMPUT APPL BIOSCI, V4, P67