Applying the Naive Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites

被引：217

作者：

Murakami, Yoichi ^{[1
]}

Mizuguchi, Kenji ^{[1
]}

机构：

[1] Natl Inst Biomed Innovat, Osaka, Japan

来源：

BIOINFORMATICS | 2010年 / 26卷 / 15期

关键词：

BINDING-SITES; SOLVENT ACCESSIBILITY; SECONDARY STRUCTURE; SEQUENCE PROFILE; DATA-BANK; DATABASE; INFORMATION; INTERFACES; NETWORKS;

D O I：

10.1093/bioinformatics/btq302

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: The limited availability of protein structures often restricts the functional annotation of proteins and the identification of their protein-protein interaction sites. Computational methods to identify interaction sites from protein sequences alone are, therefore, required for unraveling the functions of many proteins. This article describes a new method (PSIVER) to predict interaction sites, i.e. residues binding to other proteins, in protein sequences. Only sequence features (position-specific scoring matrix and predicted accessibility) are used for training a Naive Bayes classifier (NBC), and conditional probabilities of each sequence feature are estimated using a kernel density estimation method (KDE). Results: The leave-one out cross-validation of PSIVER achieved a Matthews correlation coefficient (MCC) of 0.151, an F-measure of 35.3%, a precision of 30.6% and a recall of 41.6% on a non-redundant set of 186 protein sequences extracted from 105 heterodimers in the Protein Data Bank (consisting of 36 219 residues, of which 15.2% were known interface residues). Even though the dataset used for training was highly imbalanced, a randomization test demonstrated that the proposed method managed to avoid overfitting. PSIVER was also tested on 72 sequences not used in training (consisting of 18 140 residues, of which 10.6% were known interface residues), and achieved an MCC of 0.135, an F-measure of 31.5%, a precision of 25.0% and a recall of 46.5%, outperforming other publicly available servers tested on the same dataset. PSIVER enables experimental biologists to identify potential interface residues in unknown proteins from sequence information alone, and to mutate those residues selectively in order to unravel protein functions.

引用

页码：1841 / 1848

页数：8

共 39 条

[1] Combining prediction of secondary structure and solvent accessibility in proteins [J].