Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

被引:70
作者
Zhang, Yanju [1 ]
Xie, Ruopeng [2 ,3 ,4 ]
Wang, Jiawei [2 ,4 ]
Leier, Andre [5 ,6 ]
Marquez-Lago, Tatiana T. [7 ,8 ]
Akutsu, Tatsuya [9 ]
Webb, Geoffrey, I [10 ,11 ]
Chou, Kuo-Chen [12 ]
Song, Jiangning [4 ,11 ,13 ]
机构
[1] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Bioinformat Grp, Guilin 541004, Peoples R China
[2] Monash Univ, Dept Microbiol, Melbourne, Vic 3800, Australia
[3] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Guilin, Peoples R China
[4] Monash Univ, Biomed Discovery Inst, Melbourne, Vic, Australia
[5] Univ Alabama Birmingham, Sch Med, Dept Genet, Birmingham, AL USA
[6] UAB Comprehens Canc Ctr, Birmingham, AL USA
[7] UAB, Sch Med, Dept Genet, Birmingham, AL USA
[8] UAB, Sch Med, Dept Cell Dev & Integrat Biol, Birmingham, AL USA
[9] Kyoto Univ, Bioinformat Ctr, Inst Chem Res, Kyoto, Japan
[10] Monash Univ, Fac Informat Technol, Melbourne, Vic, Australia
[11] Monash Univ, Monash Ctr Data Sci, Melbourne, Vic, Australia
[12] Gordon Life Sci Inst, Boston, MA USA
[13] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic, Australia
基金
美国国家卫生研究院; 澳大利亚研究理事会;
关键词
lysine malonylation; computational prediction; feature encoding methods; machine learning; ensemble learning; Light Gradient Boosting Machine; SUBCELLULAR LOCATION; ACCURATE PREDICTION; SCORING MATRIX; WEB SERVER; PSSM; PROTEINS; SEQUENCE; REPRESENTATION; SUCCINYLATION; RESIDUES;
D O I
10.1093/bib/bby079
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.
引用
收藏
页码:2185 / 2199
页数:15
相关论文
共 76 条
[1]   Feature normalization and likelihood-based similarity measures for image retrieval [J].
Aksoy, S ;
Haralick, RM .
PATTERN RECOGNITION LETTERS, 2001, 22 (05) :563-582
[2]  
Ambler RP, 1959, NATURE, V183, P1654
[3]   Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI [J].
An, Yi ;
Wang, Jiawei ;
Li, Chen ;
Leier, Andre ;
Marquez-Lago, Tatiana ;
Wilksch, Jonathan ;
Zhang, Yang ;
Webb, Geoffrey I. ;
Song, Jiangning ;
Lithgow, Trevor .
BRIEFINGS IN BIOINFORMATICS, 2018, 19 (01) :148-161
[4]   Ongoing and future developments at the Universal Protein Resource [J].
Apweiler, Rolf ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Alam-Faruque, Yasmin ;
Antunes, Ricardo ;
Barrell, Daniel ;
Bely, Benoit ;
Bingley, Mark ;
Binns, David ;
Bower, Lawrence ;
Browne, Paul ;
Chan, Wei Mun ;
Dimmer, Emily ;
Eberhardt, Ruth ;
Fazzini, Francesco ;
Fedotov, Alexander ;
Foulger, Rebecca ;
Garavelli, John ;
Castro, Leyla Garcia ;
Huntley, Rachael ;
Jacobsen, Julius ;
Kleen, Michael ;
Laiho, Kati ;
Legge, Duncan ;
Lin, Quan ;
Liu, Wudong ;
Luo, Jie ;
Orchard, Sandra ;
Patient, Samuel ;
Pichler, Klemens ;
Poggioli, Diego ;
Pontikos, Nikolas ;
Pruess, Manuela ;
Rosanoff, Steven ;
Sawford, Tony ;
Sehra, Harminder ;
Turner, Edward ;
Corbett, Matt ;
Donnelly, Mike ;
van Rensburg, Pieter ;
Xenarios, Ioannis ;
Bougueleret, Lydie ;
Auchincloss, Andrea ;
Argoud-Puy, Ghislaine ;
Axelsen, Kristian ;
Bairoch, Amos ;
Baratin, Delphine ;
Blatter, Marie-Claude ;
Boeckmann, Brigitte .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D214-D219
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   Genome-Scale Identification of Legionella pneumophila Effectors Using a Machine Learning Approach [J].
Burstein, David ;
Zusman, Tal ;
Degtyar, Elena ;
Viner, Ram ;
Segal, Gil ;
Pupko, Tal .
PLOS PATHOGENS, 2009, 5 (07)
[7]  
Caruana R., 2006, P 23 INT C MACHINE L, P161
[8]   PFRES: protein fold classification by using evolutionary information and predicted secondary structure [J].
Chen, Ke ;
Kurgan, Lukasz .
BIOINFORMATICS, 2007, 23 (21) :2843-2850
[9]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[10]  
Chen YH, 2016, Adv Inform Managemen, P103, DOI 10.1109/IMCEC.2016.7867181