PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins

被引:41
作者
Zhang, Yanju [1 ]
Yu, Sha [1 ,2 ,3 ]
Xie, Ruopeng [1 ,2 ,3 ]
Li, Jiahui [1 ,2 ,3 ]
Leier, Andre [4 ,5 ]
Marquez-Lago, Tatiana T. [4 ,5 ]
Akutsu, Tatsuya [6 ]
Smith, A. Ian [2 ,3 ,7 ]
Ge, Zongyuan [8 ,9 ]
Wang, Jiawei [2 ,3 ]
Lithgow, Trevor [2 ,3 ]
Song, Jiangning [2 ,3 ,7 ]
机构
[1] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Bioinformat Grp, Guilin 541004, Peoples R China
[2] Monash Univ, Biomed Discovery Inst, Infect & Immun Program, Melbourne, Vic 3800, Australia
[3] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic 3800, Australia
[4] Univ Alabama Birmingham, Sch Med, Dept Genet, Birmingham, AL USA
[5] Univ Alabama Birmingham, Sch Med, Dept Cell Dev & Integrat Biol, Birmingham, AL USA
[6] Kyoto Univ, Bioinformat Ctr, Inst Chem Res, Uji, Kyoto 6110011, Japan
[7] Monash Univ, ARC Ctr Excellence Adv Mol Imaging, Melbourne, Vic 3800, Australia
[8] Monash Univ, Monash E Res Ctr, Melbourne, Vic 3800, Australia
[9] Monash Univ, Fac Engn, Melbourne, Vic 3800, Australia
基金
澳大利亚研究理事会; 中国国家自然科学基金; 美国国家卫生研究院; 英国医学研究理事会;
关键词
WEB SERVER; INFORMATION; SEQUENCES; TRANSPORT; SYSTEMS; EXPORT;
D O I
10.1093/bioinformatics/btz629
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. Results: In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.
引用
收藏
页码:704 / 712
页数:9
相关论文
共 50 条
[31]   The mystery of nonclassical protein secretion - A current view on cargo proteins and potential export routes [J].
Nickel, W .
EUROPEAN JOURNAL OF BIOCHEMISTRY, 2003, 270 (10) :2109-2119
[32]   TRI_tool: a web-tool for prediction of protein-protein interactions in human transcriptional regulation [J].
Perovic, Vladimir ;
Sumonja, Neven ;
Gemovic, Branislava ;
Toska, Eneda ;
Roberts, Stefan G. ;
Veljkovic, Nevena .
BIOINFORMATICS, 2017, 33 (02) :289-291
[33]   Secretion without Golgi [J].
Prudovsky, Igor ;
Tarantini, Francesca ;
Landriscina, Matteo ;
Neivandt, David ;
Soldi, Raffaella ;
Kirov, Aleksandr ;
Small, Deena ;
Kathir, Karuppanan Muthusamy ;
Rajalingam, Dakshinamurthy ;
Kumar, Thallapuranam Krishnaswamy Suresh .
JOURNAL OF CELLULAR BIOCHEMISTRY, 2008, 103 (05) :1327-1343
[34]   Prediction of membrane protein types from sequences and. position-specific scoring matrices [J].
Pua, Xian ;
Guo, Han ;
Leung, Howard ;
Lin, Yuanlie .
JOURNAL OF THEORETICAL BIOLOGY, 2007, 247 (02) :259-265
[35]   NClassG+: A classifier for non-classically secreted Gram-positive bacterial proteins [J].
Restrepo-Montoya, Daniel ;
Pino, Camilo ;
Nino, Luis F. ;
Patarroyo, Manuel E. ;
Patarroyo, Manuel A. .
BMC BIOINFORMATICS, 2011, 12
[36]   Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC [J].
Sharma, Ronesh ;
Dehzangi, Abdollah ;
Lyons, James ;
Paliwal, Kuldip ;
Tsunoda, Tatsuhiko ;
Sharma, Alok .
IEEE TRANSACTIONS ON NANOBIOSCIENCE, 2015, 14 (08) :915-926
[37]   PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition [J].
Shen, Hong-Bin ;
Chou, Kuo-Chen .
ANALYTICAL BIOCHEMISTRY, 2008, 373 (02) :386-388
[38]   Predictina protein-protein interactions based only on sequences information [J].
Shen, Juwen ;
Zhang, Jian ;
Luo, Xiaomin ;
Zhu, Weiliang ;
Yu, Kunqian ;
Chen, Kaixian ;
Li, Yixue ;
Jiang, Hualiang .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (11) :4337-4341
[39]   Clostridium difficile infection [J].
Smits, Wiep Klaas ;
Lyras, Dena ;
Lacy, D. Borden ;
Wilcox, Mark H. ;
Kuijper, Ed J. .
NATURE REVIEWS DISEASE PRIMERS, 2016, 2 :1-20
[40]   Signal peptide-dependent protein transport in Bacillus subtilis:: a genome-based survey of the secretome [J].
Tjalsma, H ;
Bolhuis, A ;
Jongbloed, JDH ;
Bron, S ;
van Dijl, JM .
MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS, 2000, 64 (03) :515-+