SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins

被引:32
作者
Ahmad, Saeed [1 ]
Charoenkwan, Phasit [2 ]
Quinn, Julian M. W. [3 ]
Moni, Mohammad Ali [4 ]
Hasan, Md Mehedi [5 ]
Lio, Pietro [6 ]
Shoombuatong, Watshara [1 ]
机构
[1] Mahidol Univ, Fac Med Technol, Ctr Data Min & Biomed Informat, Bangkok 10700, Thailand
[2] Chiang Mai Univ, Modern Management & Informat Technol, Coll Arts Media & Technol, Chiang Mai 50200, Thailand
[3] Garvan Inst Med Res, Bone Biol Div, 384 Victoria St, Darlinghurst, NSW 2010, Australia
[4] Univ Queensland, Fac Hlth & Behav Sci, Sch Hlth & Rehabil Sci, St Lucia, Qld 4072, Australia
[5] Tulane Univ, Tulane Ctr Biomed Informat & Genom, Sch Med, John W Deming Dept Med,Div Biomed Informat & Geno, New Orleans, LA 70112 USA
[6] Univ Cambridge, Dept Comp Sci & Technol, Cambridge CB3 0FD, England
关键词
BACTERIOPHAGE VIRION; FEATURE-SELECTION; IDENTIFICATION; PEPTIDES; FEATURES;
D O I
10.1038/s41598-022-08173-5
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Fast and accurate identification of phage virion proteins (PVPs) would greatly aid facilitation of antibacterial drug discovery and development. Although, several research efforts based on machine learning (ML) methods have been made for in silico identification of PVPs, these methods have certain limitations. Therefore, in this study, we propose a new computational approach, termed SCORPION, (StaCking-based Predictior fOR Phage VIrion PrOteiNs), to accurately identify PVPs using only protein primary sequences. Specifically, we explored comprehensive 13 different feature descriptors from different aspects (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) with 10 popular ML algorithms to construct a pool of optimal baseline models. These optimal baseline models were then used to generate probabilistic features (PFs) and considered as a new feature vector. Finally, we utilized a two-step feature selection strategy to determine the optimal PF feature vector and used this feature vector to develop a stacked model (SCORPION). Both tenfold cross-validation and independent test results indicate that SCORPION achieves superior predictive performance than its constitute baseline models and existing methods. We anticipate SCORPION will serve as a useful tool for the cost-effective and large-scale screening of new PVPs. The source codes and datasets for this work are available for downloading in the GitHub repository (https://github.com/saeed344/SCORPION).
引用
收藏
页数:15
相关论文
共 55 条
[1]   Prediction of human phosphorylated proteins by extracting multi-perspective discriminative features from the evolutionary profile and physicochemical properties through LFDA [J].
Ahmed, Saeed ;
Kabir, Muhammad ;
Arif, Muhammad ;
Ali, Zakir ;
Swati, Zar Nawab Khan .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2020, 203
[2]   Pred-BVP-Unb: Fast prediction of bacteriophage Virion proteins using un-biased multi-perspective properties with recursive feature elimination [J].
Arif, Muhammad ;
Ali, Farman ;
Ahmad, Saeed ;
Kabir, Muhammad ;
Ali, Zakir ;
Hayat, Maqsood .
GENOMICS, 2020, 112 (02) :1565-1574
[3]   Estimating confidence intervals for information transfer analysis of confusion matrices [J].
Azadpour, Mahan ;
McKay, Colette M. ;
Smith, Robert L. .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2014, 135 (03) :EL140-EL146
[4]   STALLION: a stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction [J].
Basith, Shaherin ;
Lee, Gwang ;
Manavalan, Balachandran .
BRIEFINGS IN BIOINFORMATICS, 2022, 23 (01)
[5]   UniProt: a worldwide hub of protein knowledge [J].
Bateman, Alex ;
Martin, Maria-Jesus ;
Orchard, Sandra ;
Magrane, Michele ;
Alpi, Emanuele ;
Bely, Benoit ;
Bingley, Mark ;
Britto, Ramona ;
Bursteinas, Borisas ;
Busiello, Gianluca ;
Bye-A-Jee, Hema ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Castro, Leyla Garcia ;
Garmiri, Penelope ;
Georghiou, George ;
Gonzales, Daniel ;
Gonzales, Leonardo ;
Hatton-Ellis, Emma ;
Ignatchenko, Alexandr ;
Ishtiaq, Rizwan ;
Jokinen, Petteri ;
Joshi, Vishal ;
Jyothi, Dushyanth ;
Lopez, Rodrigo ;
Luo, Jie ;
Lussi, Yvonne ;
MacDougall, Alistair ;
Madeira, Fabio ;
Mahmoudy, Mahdi ;
Menchi, Manuela ;
Nightingale, Andrew ;
Onwubiko, Joseph ;
Palka, Barbara ;
Pichler, Klemens ;
Pundir, Sangya ;
Qi, Guoying ;
Raj, Shriya ;
Renaux, Alexandre ;
Lopez, Milagros Rodriguez ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Speretta, Elena ;
Turner, Edward ;
Tyagi, Nidhi ;
Vasudev, Preethi ;
Volynkin, Vladimir ;
Wardell, Tony .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D506-D515
[6]   StackDPPIV: A novel computational approach for accurate prediction of dipeptidyl peptidase IV (DPP-IV) inhibitory peptides [J].
Charoenkwan, Phasit ;
Nantasenamat, Chanin ;
Hasan, Md Mehedi ;
Moni, Mohammad Ali ;
Lio, Pietro ;
Manavalan, Balachandran ;
Shoombuatong, Watshara .
METHODS, 2022, 204 :189-198
[7]   StackIL6: a stacking ensemble model for improving the prediction of IL-6 inducing peptides [J].
Charoenkwan, Phasit ;
Chiangjong, Wararat ;
Nantasenamat, Chanin ;
Hasan, Md Mehedi ;
Manavalan, Balachandran ;
Shoombuatong, Watshara .
BRIEFINGS IN BIOINFORMATICS, 2021, 22 (06)
[8]   Improved prediction and characterization of anticancer activities of peptides using a novel flexible scoring card method [J].
Charoenkwan, Phasit ;
Chiangjong, Wararat ;
Lee, Vannajan Sanghiran ;
Nantasenamat, Chanin ;
Hasan, Md Mehedi ;
Shoombuatong, Watshara .
SCIENTIFIC REPORTS, 2021, 11 (01)
[9]   iUmami-SCM: A Novel Sequence-Based Predictor for Prediction and Analysis of Umami Peptides Using a Scoring Card Method with Propensity Scores of Dipeptides [J].
Charoenkwan, Phasit ;
Yana, Janchai ;
Nantasenamat, Chanin ;
Hasan, Mehedi ;
Shoombuatong, Watshara .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2020, 60 (12) :6666-6678
[10]   iDPPIV-SCM: A Sequence-Based Predictor for Identifying and Analyzing Dipeptidyl Peptidase IV (DPP-IV) Inhibitory Peptides Using a Scoring Card Method [J].
Charoenkwan, Phasit ;
Kanthawong, Sakawrat ;
Nantasenamat, Chanin ;
Hasan, Mehedi ;
Shoombuatong, Watshara .
JOURNAL OF PROTEOME RESEARCH, 2020, 19 (10) :4125-4136