Applying the Naive Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites

被引:217
作者
Murakami, Yoichi [1 ]
Mizuguchi, Kenji [1 ]
机构
[1] Natl Inst Biomed Innovat, Osaka, Japan
关键词
BINDING-SITES; SOLVENT ACCESSIBILITY; SECONDARY STRUCTURE; SEQUENCE PROFILE; DATA-BANK; DATABASE; INFORMATION; INTERFACES; NETWORKS;
D O I
10.1093/bioinformatics/btq302
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The limited availability of protein structures often restricts the functional annotation of proteins and the identification of their protein-protein interaction sites. Computational methods to identify interaction sites from protein sequences alone are, therefore, required for unraveling the functions of many proteins. This article describes a new method (PSIVER) to predict interaction sites, i.e. residues binding to other proteins, in protein sequences. Only sequence features (position-specific scoring matrix and predicted accessibility) are used for training a Naive Bayes classifier (NBC), and conditional probabilities of each sequence feature are estimated using a kernel density estimation method (KDE). Results: The leave-one out cross-validation of PSIVER achieved a Matthews correlation coefficient (MCC) of 0.151, an F-measure of 35.3%, a precision of 30.6% and a recall of 41.6% on a non-redundant set of 186 protein sequences extracted from 105 heterodimers in the Protein Data Bank (consisting of 36 219 residues, of which 15.2% were known interface residues). Even though the dataset used for training was highly imbalanced, a randomization test demonstrated that the proposed method managed to avoid overfitting. PSIVER was also tested on 72 sequences not used in training (consisting of 18 140 residues, of which 10.6% were known interface residues), and achieved an MCC of 0.135, an F-measure of 31.5%, a precision of 25.0% and a recall of 46.5%, outperforming other publicly available servers tested on the same dataset. PSIVER enables experimental biologists to identify potential interface residues in unknown proteins from sequence information alone, and to mutate those residues selectively in order to unravel protein functions.
引用
收藏
页码:1841 / 1848
页数:8
相关论文
共 39 条
[1]   Combining prediction of secondary structure and solvent accessibility in proteins [J].
Adamczak, R ;
Porollo, A ;
Meller, J .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 59 (03) :467-475
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]  
[Anonymous], 1997, MACHINE LEARNING, MCGRAW-HILL SCIENCE/ENGINEERING/MATH
[4]   Assessing the accuracy of prediction algorithms for classification: an overview [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Andersen, CAF ;
Nielsen, H .
BIOINFORMATICS, 2000, 16 (05) :412-424
[5]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[6]   Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces [J].
Burgoyne, Nicholas J. ;
Jackson, Richard M. .
BIOINFORMATICS, 2006, 22 (11) :1335-1342
[7]   Sequence-based prediction of protein interaction sites with an integrative method [J].
Chen, Xue-Wen ;
Jeong, Jong Cheol .
BIOINFORMATICS, 2009, 25 (05) :585-591
[8]   The HSSP database of protein structure sequence alignments and family profiles [J].
Dodge, C ;
Schneider, R ;
Sander, C .
NUCLEIC ACIDS RESEARCH, 1998, 26 (01) :313-315
[9]   Progress and challenges in predicting protein-protein interaction sites [J].
Ezkurdia, Lakes ;
Bartoli, Lisa ;
Fariselli, Piero ;
Casadio, Rita ;
Valencia, Alfonso ;
Tress, Michael L. .
BRIEFINGS IN BIOINFORMATICS, 2009, 10 (03) :233-246
[10]   Prediction of protein-protein interaction sites in heterocomplexes with neural networks [J].
Fariselli, P ;
Pazos, F ;
Valencia, A ;
Casadio, R .
EUROPEAN JOURNAL OF BIOCHEMISTRY, 2002, 269 (05) :1356-1361