A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models

被引:6
作者
Bernardes, Juliana S. [1 ,2 ]
Carbone, Alessandra [2 ,3 ]
Zaverucha, Gerson
机构
[1] Univ Fed Rio de Janeiro, COPPE, Programa Engn Sistemas & Comp, BR-21945 Rio De Janeiro, Brazil
[2] Univ Paris 06, UMR Genom Analyt 7238, F-75006 Paris, France
[3] CNRS, Lab Genom Microorganismes, UMR7238, F-75006 Paris, France
关键词
HIDDEN MARKOV-MODELS; STRING KERNELS; DATABASE; TOOL;
D O I
10.1186/1471-2105-12-83
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physicochemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM). Results: We use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function. Conclusions: The strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.
引用
收藏
页数:13
相关论文
共 49 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]   Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures [J].
Alexandrov, V ;
Gerstein, M .
BMC BIOINFORMATICS, 2004, 5 (1)
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[5]   SCOP database in 2004: refinements integrate structure and sequence family data [J].
Andreeva, A ;
Howorth, D ;
Brenner, SE ;
Hubbard, TJP ;
Chothia, C ;
Murzin, AG .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D226-D229
[6]  
[Anonymous], 1993, The Morgan Kaufmann Series in Machine Learning
[7]   Implicit motif distribution based hybrid computational kernel for sequence classification [J].
Atalay, V ;
Cetin-Atalay, R .
BIOINFORMATICS, 2005, 21 (08) :1429-1436
[8]   Remote homology detection: a motif based approach [J].
Ben-Hur, Asa ;
Brutlag, Douglas .
BIOINFORMATICS, 2003, 19 :i26-i33
[9]  
BERNARDES J, 2007, BMC BIOINFORMATICS, V435, P1
[10]   The ASTRAL compendium for protein structure and sequence analysis [J].
Brenner, SE ;
Koehl, P ;
Levitt, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :254-256