PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins

被引:41
作者
Zhang, Yanju [1 ]
Yu, Sha [1 ,2 ,3 ]
Xie, Ruopeng [1 ,2 ,3 ]
Li, Jiahui [1 ,2 ,3 ]
Leier, Andre [4 ,5 ]
Marquez-Lago, Tatiana T. [4 ,5 ]
Akutsu, Tatsuya [6 ]
Smith, A. Ian [2 ,3 ,7 ]
Ge, Zongyuan [8 ,9 ]
Wang, Jiawei [2 ,3 ]
Lithgow, Trevor [2 ,3 ]
Song, Jiangning [2 ,3 ,7 ]
机构
[1] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Bioinformat Grp, Guilin 541004, Peoples R China
[2] Monash Univ, Biomed Discovery Inst, Infect & Immun Program, Melbourne, Vic 3800, Australia
[3] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic 3800, Australia
[4] Univ Alabama Birmingham, Sch Med, Dept Genet, Birmingham, AL USA
[5] Univ Alabama Birmingham, Sch Med, Dept Cell Dev & Integrat Biol, Birmingham, AL USA
[6] Kyoto Univ, Bioinformat Ctr, Inst Chem Res, Uji, Kyoto 6110011, Japan
[7] Monash Univ, ARC Ctr Excellence Adv Mol Imaging, Melbourne, Vic 3800, Australia
[8] Monash Univ, Monash E Res Ctr, Melbourne, Vic 3800, Australia
[9] Monash Univ, Fac Engn, Melbourne, Vic 3800, Australia
基金
澳大利亚研究理事会; 中国国家自然科学基金; 美国国家卫生研究院; 英国医学研究理事会;
关键词
WEB SERVER; INFORMATION; SEQUENCES; TRANSPORT; SYSTEMS; EXPORT;
D O I
10.1093/bioinformatics/btz629
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. Results: In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.
引用
收藏
页码:704 / 712
页数:9
相关论文
共 50 条
[1]   Protein Secretion in Gram-Positive Bacteria: From Multiple Pathways to Biotechnology [J].
Anne, Jozef ;
Economou, Anastassios ;
Bernaerts, Kristel .
PROTEIN AND SUGAR EXPORT AND ASSEMBLY IN GRAM-POSITIVE BACTERIA, 2017, 404 :267-308
[2]  
[Anonymous], 2018, BRIEF BIOINFORM
[3]  
[Anonymous], MICROBIOL SPECTR, DOI DOI 10.1128/microbiolspec.UTI-0012-2012
[4]   The rise of the Enterococcus: beyond vancomycin resistance [J].
Arias, Cesar A. ;
Murray, Barbara E. .
NATURE REVIEWS MICROBIOLOGY, 2012, 10 (04) :266-278
[5]   UniProt: a hub for protein information [J].
Bateman, Alex ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Apweiler, Rolf ;
Alpi, Emanuele ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Fazzini, Francesco ;
Gane, Paul ;
Cas-tro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightin-gale, Andrew ;
Orchard, Sandra ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier ;
Zellner, Hermann ;
Cowley, Andrew ;
Figueira, Luis ;
Li, Weizhong ;
McWilliam, Hamish .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D204-D212
[6]   Non-classical protein secretion in bacteria [J].
Bendtsen, JD ;
Kiemer, L ;
Fausboll, A ;
Brunak, S .
BMC MICROBIOLOGY, 2005, 5 (1)
[7]   Feature-based prediction of non-classical and leaderless protein secretion [J].
Bendtsen, JD ;
Jensen, LJ ;
Blom, N ;
von Heijne, G ;
Brunak, S .
PROTEIN ENGINEERING DESIGN & SELECTION, 2004, 17 (04) :349-356
[8]   Type VII Secretion Systems in Gram-Positive Bacteria [J].
Bottai, Daria ;
Groschel, Matthias I. ;
Brosch, Roland .
PROTEIN AND SUGAR EXPORT AND ASSEMBLY IN GRAM-POSITIVE BACTERIA, 2017, 404 :235-265
[9]   Different members of the IL-1 family come out in different ways: DAMPs vs. cytokines? [J].
Carta, Sonia ;
Lavieri, Rosa ;
Rubartelli, Anna .
FRONTIERS IN IMMUNOLOGY, 2013, 4
[10]   Sequence-based prediction of protein interaction sites with an integrative method [J].
Chen, Xue-Wen ;
Jeong, Jong Cheol .
BIOINFORMATICS, 2009, 25 (05) :585-591