Remote homology detection: a motif based approach

被引:109
作者
Ben-Hur, Asa [1 ]
Brutlag, Douglas [1 ]
机构
[1] Stanford Univ, Dept Biochem, Beckman Ctr B400, Stanford, CA 94305 USA
关键词
remote homology; discrete sequence motifs; sequence similarity; Support Vector Machines; kernel methods;
D O I
10.1093/bioinformatics/btg1002
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. Results: We present a method for detecting remote homology that is based on the presence of discrete sequence motifs. The motif content of a pair of sequences is used to define a similarity that is used as a kernel for a Support Vector Machine (SVM) classifier. We test the method on two remote homology detection tasks: prediction of a previously unseen SCOP family and prediction of an enzyme class given other enzymes that have a similar function on other substrates. We find that it performs significantly better than an SVM method that uses BLAST or Smith-Waterman similarity scores as features.
引用
收藏
页码:i26 / i33
页数:8
相关论文
共 22 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], 2002, LIBSVM LIB SUPPORT V
[3]  
[Anonymous], 2002, Proc. of the Intl. Conf. on Research in Computational Molecular Biology
[4]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[5]   The ASTRAL compendium for protein structure and sequence analysis [J].
Brenner, SE ;
Koehl, P ;
Levitt, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :254-256
[6]  
Cristianini N, 2000, Intelligent Data Analysis: An Introduction
[7]  
Egan J.P., 1975, SERIES COGNITION PER
[8]   The PROSITE database, its status in 2002 [J].
Falquet, L ;
Pagni, M ;
Bucher, P ;
Hulo, N ;
Sigrist, CJA ;
Hofmann, K ;
Bairoch, A .
NUCLEIC ACIDS RESEARCH, 2002, 30 (01) :235-238
[9]   Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations [J].
Henikoff, S ;
Henikoff, JG ;
Pietrokovski, S .
BIOINFORMATICS, 1999, 15 (06) :471-479
[10]   The EMOTIF database [J].
Huang, JY ;
Brutlag, DL .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :202-204