Developing Computational Model to Predict Protein-Protein Interaction Sites Based on the XGBoost Algorithm

被引：46

作者：

Deng, Aijun ^{[1
,2
,3
]}

Zhang, Huan ^{[4
]}

Wang, Wenyan ^{[4
]}

Zhang, Jun ^{[5
]}

Fan, Dingdong ^{[2
]}

Chen, Peng ^{[5
]}

Wang, Bing ^{[1
,4
,5
]}

机构：

[1] Anhui Univ Technol, Minist Educ, Key Lab Met Emiss Reduct & Resources Recycling, Maanshan 243002, Peoples R China

[2] Anhui Univ Technol, Sch Met Engn, Maanshan 243032, Peoples R China

[3] Univ Leicester, Dept Engn, Leicester LE1 7RH, Leics, England

[4] Anhui Univ Technol, Sch Elect & Informat Engn, Maanshan 243032, Peoples R China

[5] Anhui Univ, Coinnovat Ctr Informat Supply & Assurance Technol, Hefei 230032, Peoples R China

来源：

INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES | 2020年 / 21卷 / 07期

基金：

中国国家自然科学基金;

关键词：

protein interaction sites; unbalanced data sets; overlapping regions; XGBoost; PHOSPHORYLATION SITES; IDENTIFICATION; SEQUENCE; EVOLUTION; ENSEMBLE;

D O I：

10.3390/ijms21072274

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

The study of protein-protein interaction is of great biological significance, and the prediction of protein-protein interaction sites can promote the understanding of cell biological activity and will be helpful for drug development. However, uneven distribution between interaction and non-interaction sites is common because only a small number of protein interactions have been confirmed by experimental techniques, which greatly affects the predictive capability of computational methods. In this work, two imbalanced data processing strategies based on XGBoost algorithm were proposed to re-balance the original dataset from inherent relationship between positive and negative samples for the prediction of protein-protein interaction sites. Herein, a feature extraction method was applied to represent the protein interaction sites based on evolutionary conservatism of proteins, and the influence of overlapping regions of positive and negative samples was considered in prediction performance. Our method showed good prediction performance, such as prediction accuracy of 0.807 and MCC of 0.614, on an original dataset with 10,455 surface residues but only 2297 interface residues. Experimental results demonstrated the effectiveness of our XGBoost-based method.

引用

页数：13

共 46 条

[1] [Anonymous], 2014, ENTROPY S
[2] Statistical analysis of predominantly transient protein-protein interfaces
Ansari, S
Helms, V
[J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 61 (02) : 344 - 355
[3] Flexible protein-protein docking
Bonvin, AM
[J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 2006, 16 (02) : 194 - 200
[4] Random forests
Breiman, L
[J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
[5] Distinguishing structural and functional restraints in evolution in order to identify interaction sites
Chelliah, V
Chen, L
Blundell, TL
Lovell, SC
[J]. JOURNAL OF MOLECULAR BIOLOGY, 2004, 342 (05) : 1487 - 1504
[6] DomSVR: domain boundary prediction with support vector regression from sequence information alone
Chen, Peng
Liu, Chunmei
Burge, Legand
Li, Jinyan
Mohammad, Mahmood
Southerland, William
Gloster, Clay
Wang, Bing
[J]. AMINO ACIDS, 2010, 39 (03) : 713 - 726
[7] Chen T, 2016, PROC 22 ACM SIGKDD I, P785, DOI DOI 10.1145/2939672.2939785
[8] Exploring the potential of 3D Zernike descriptors and SVM for protein-protein interface prediction
Daberdaku, Sebastian
Ferrari, Carlo
[J]. BMC BIOINFORMATICS, 2018, 19
[9] Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier
Dhole, Kaustubh
Singh, Gurdeep
Pai, Priyadarshini P.
Mondal, Sukanta
[J]. JOURNAL OF THEORETICAL BIOLOGY, 2014, 348 : 47 - 54
[10] Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and random forest
Fan, Wenwen
Xu, Xiaoyi
Shen, Yi
Feng, Huanqing
Li, Ao
Wang, Minghui
[J]. AMINO ACIDS, 2014, 46 (04) : 1069 - 1078

← 1 2 3 4 5 →