Developing Computational Model to Predict Protein-Protein Interaction Sites Based on the XGBoost Algorithm

被引:46
作者
Deng, Aijun [1 ,2 ,3 ]
Zhang, Huan [4 ]
Wang, Wenyan [4 ]
Zhang, Jun [5 ]
Fan, Dingdong [2 ]
Chen, Peng [5 ]
Wang, Bing [1 ,4 ,5 ]
机构
[1] Anhui Univ Technol, Minist Educ, Key Lab Met Emiss Reduct & Resources Recycling, Maanshan 243002, Peoples R China
[2] Anhui Univ Technol, Sch Met Engn, Maanshan 243032, Peoples R China
[3] Univ Leicester, Dept Engn, Leicester LE1 7RH, Leics, England
[4] Anhui Univ Technol, Sch Elect & Informat Engn, Maanshan 243032, Peoples R China
[5] Anhui Univ, Coinnovat Ctr Informat Supply & Assurance Technol, Hefei 230032, Peoples R China
基金
中国国家自然科学基金;
关键词
protein interaction sites; unbalanced data sets; overlapping regions; XGBoost; PHOSPHORYLATION SITES; IDENTIFICATION; SEQUENCE; EVOLUTION; ENSEMBLE;
D O I
10.3390/ijms21072274
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The study of protein-protein interaction is of great biological significance, and the prediction of protein-protein interaction sites can promote the understanding of cell biological activity and will be helpful for drug development. However, uneven distribution between interaction and non-interaction sites is common because only a small number of protein interactions have been confirmed by experimental techniques, which greatly affects the predictive capability of computational methods. In this work, two imbalanced data processing strategies based on XGBoost algorithm were proposed to re-balance the original dataset from inherent relationship between positive and negative samples for the prediction of protein-protein interaction sites. Herein, a feature extraction method was applied to represent the protein interaction sites based on evolutionary conservatism of proteins, and the influence of overlapping regions of positive and negative samples was considered in prediction performance. Our method showed good prediction performance, such as prediction accuracy of 0.807 and MCC of 0.614, on an original dataset with 10,455 surface residues but only 2297 interface residues. Experimental results demonstrated the effectiveness of our XGBoost-based method.
引用
收藏
页数:13
相关论文
共 46 条
  • [1] [Anonymous], 2014, ENTROPY S
  • [2] Statistical analysis of predominantly transient protein-protein interfaces
    Ansari, S
    Helms, V
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 61 (02) : 344 - 355
  • [3] Flexible protein-protein docking
    Bonvin, AM
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 2006, 16 (02) : 194 - 200
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] Distinguishing structural and functional restraints in evolution in order to identify interaction sites
    Chelliah, V
    Chen, L
    Blundell, TL
    Lovell, SC
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2004, 342 (05) : 1487 - 1504
  • [6] DomSVR: domain boundary prediction with support vector regression from sequence information alone
    Chen, Peng
    Liu, Chunmei
    Burge, Legand
    Li, Jinyan
    Mohammad, Mahmood
    Southerland, William
    Gloster, Clay
    Wang, Bing
    [J]. AMINO ACIDS, 2010, 39 (03) : 713 - 726
  • [7] Chen T, 2016, PROC 22 ACM SIGKDD I, P785, DOI DOI 10.1145/2939672.2939785
  • [8] Exploring the potential of 3D Zernike descriptors and SVM for protein-protein interface prediction
    Daberdaku, Sebastian
    Ferrari, Carlo
    [J]. BMC BIOINFORMATICS, 2018, 19
  • [9] Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier
    Dhole, Kaustubh
    Singh, Gurdeep
    Pai, Priyadarshini P.
    Mondal, Sukanta
    [J]. JOURNAL OF THEORETICAL BIOLOGY, 2014, 348 : 47 - 54
  • [10] Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and random forest
    Fan, Wenwen
    Xu, Xiaoyi
    Shen, Yi
    Feng, Huanqing
    Li, Ao
    Wang, Minghui
    [J]. AMINO ACIDS, 2014, 46 (04) : 1069 - 1078