Detection of Outlier Residues for Improving Interface Prediction in Protein Heterocomplexes

被引:19
作者
Chen, Peng [1 ,2 ]
Wong, Limsoon [3 ]
Li, Jinyan [3 ,4 ]
机构
[1] Chinese Acad Sci, Inst Intelligent Machines, Hefei 230031, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Engn, Bioinformat Res Ctr, Singapore 639798, Singapore
[3] Natl Univ Singapore, Sch Comp, Singapore 117417, Singapore
[4] Univ Technol Sydney, Adv Analyt Inst, Broadway, NSW 2007, Australia
基金
美国国家科学基金会;
关键词
Outlier detection; protein-protein interaction; SVM ensemble; INTERACTION SITES; SEQUENCE; IDENTIFICATION; PROFILE; COMPLEXES; IDENTIFY;
D O I
10.1109/TCBB.2012.58
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions.
引用
收藏
页码:1155 / 1165
页数:11
相关论文
共 51 条
[1]  
[Anonymous], 1999, KDD, DOI [10.1145/312129.312195, DOI 10.1016/J.EC0LENG.2010.11.031]
[2]   Dissecting subunit interfaces in homodimeric proteins [J].
Bahadur, RP ;
Chakrabarti, P ;
Rodier, F ;
Janin, J .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2003, 53 (03) :708-719
[3]   A dissection of specific and non-specific protein - Protein interfaces [J].
Bahadur, RP ;
Chakrabarti, P ;
Rodier, F ;
Janin, J .
JOURNAL OF MOLECULAR BIOLOGY, 2004, 336 (04) :943-955
[4]  
BALDI P, 2000, BIOINFORMATICS MACHI
[5]  
Barnett V., 1994, Outliers in statistical data
[6]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[7]   Improved prediction of protein-protein binding sites using a support vector machines approach [J].
Bradford, JR ;
Westhead, DR .
BIOINFORMATICS, 2005, 21 (08) :1487-1494
[8]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[9]   Dissecting protein-protein recognition sites [J].
Chakrabarti, P ;
Janin, J .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 47 (03) :334-343
[10]   Anomaly Detection: A Survey [J].
Chandola, Varun ;
Banerjee, Arindam ;
Kumar, Vipin .
ACM COMPUTING SURVEYS, 2009, 41 (03)