SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins

被引:37
|
作者
Charoenkwan, Phasit [1 ]
Schaduangrat, Nalini [2 ]
Moni, Mohammad Ali [3 ]
Lio, Pietro [4 ]
Manavalan, Balachandran [5 ]
Shoombuatong, Watshara [2 ]
机构
[1] Chiang Mai Univ, Coll Arts Media & Technol, Modern Management & Informat Technol, Chiang Mai 50200, Thailand
[2] Mahidol Univ, Fac Med Technol, Ctr Data Min & Biomed Informat, Bangkok 10700, Thailand
[3] Univ Queensland, Fac Hlth & Behav Sci, Sch Hlth & Rehabil Sci, Artificial Intelligence & Digital Hlth Data Sci, St Lucia, Qld 4072, Australia
[4] Univ Cambridge, Dept Comp Sci & Technol, Cambridge CB3 0FD, England
[5] Sungkyunkwan Univ, Coll Biotechnol & Bioengn, Dept Integrat Biotechnol, Computat Biol & Bioinformat Lab, Suwon 16419, South Korea
基金
新加坡国家研究基金会;
关键词
Thermophilic protein; Sequence analysis; Bioinformatics; Stacking strategy; Feature selection; Machine learning; AMINO-ACID-COMPOSITION; FEATURE-SELECTION; WEB SERVER; THERMOSTABILITY; DISCRIMINATION; INFORMATION; MUTATION; FEATURES;
D O I
10.1016/j.compbiomed.2022.105704
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Thermophilic proteins (TPPs) are important in the field of protein biochemistry and development of new enzymes. Thus, computational methods must be urgently developed to accurately and rapidly identify TPPs. To date, several computational methods have been developed for TPP identification; however, few limitations in terms of performance and utility remain. In this study, we present a novel computational method, SAPPHIRE, to achieve more accurate identification of TPPs using only sequence information without any need for structural information. We combined twelve different feature encodings representing different perspectives and six popular machine learning algorithms to train 72 baseline models and extract the key information of TPPs. Subsequently, the informative predicted probabilities from the baseline models were mined and selected using a genetic algorithm in conjunction with a self-assessment-report approach. Finally, the final meta-predictor, SAPPHIRE, was built and optimized by applying an optimal feature set. The performance of SAPPHIRE in the 10-fold crossvalidation test showed that a superior predictive performance compared with several baseline models could be achieved. Moreover, SAPPHIRE yielded an accuracy of 0.942 and Matthew's coefficient correlation of 0.884, which were 7.68 and 5.12% higher than those of the current existing methods, respectively, as indicated by the independent test. The proposed computational approach is anticipated to facilitate large-scale identification of TPPs and accelerate their applications in the food industry. The codes and datasets are available at https://gith ub.com/plenoi/SAPPHIRE.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Towards an Accurate Breast Cancer Classification Model based on Ensemble Learning
    Hesham, Aya
    El-Rashidy, Nora
    Rezk, Amira
    Hikal, Noha A.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 590 - 602
  • [42] Prediction of Antidepressant Treatment Response and Remission Using an Ensemble Machine Learning Framework
    Lin, Eugene
    Kuo, Po-Hsiu
    Liu, Yu-Li
    Yu, Younger W-Y
    Yang, Albert C.
    Tsai, Shih-Jen
    PHARMACEUTICALS, 2020, 13 (10) : 1 - 12
  • [43] Novel hybrid ensemble credit scoring model with stacking-based noise detection and weight assignment
    Yao, Jianrong
    Wang, Zhongyi
    Wang, Lu
    Liu, Meng
    Jiang, Hui
    Chen, Yuangao
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 198
  • [44] Prediction of Plant Resistance Proteins Based on Pairwise Energy Content and Stacking Framework
    Chen, Yifan
    Li, Zejun
    Li, Zhiyong
    FRONTIERS IN PLANT SCIENCE, 2022, 13
  • [45] Accurate prediction of potential druggable proteins based on genetic algorithm and Bagging-SVM ensemble classifier
    Lin, Jianying
    Chen, Hui
    Li, Shan
    Liu, Yushuang
    Li, Xuan
    Yu, Bin
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2019, 98 : 35 - 47
  • [46] Forest Management Type Identification Based on Stacking Ensemble Learning
    Liu, Jiang
    Chen, Jingmin
    Chen, Shaozhi
    Wu, Keyi
    FORESTS, 2024, 15 (05):
  • [47] Cost-sensitive stacking ensemble learning for company financial distress prediction
    Wang, Shanshan
    Chi, Guotai
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 255
  • [48] Heterogeneous ensemble learning for enhanced crash forecasts-A frequentist and machine learning based stacking framework
    Ahmad, Numan
    Wali, Behram
    Khattak, Asad J.
    JOURNAL OF SAFETY RESEARCH, 2023, 84 : 418 - 434
  • [49] Bus Travel Time Prediction Based on Ensemble Learning Methods
    Zhong, Gang
    Yin, Tingting
    Li, Linchao
    Zhang, Jian
    Zhang, Honghai
    Ran, Bin
    IEEE INTELLIGENT TRANSPORTATION SYSTEMS MAGAZINE, 2022, 14 (02) : 174 - 189
  • [50] Forest Fire Risk Prediction Based on Stacking Ensemble Learning for Yunnan Province of China
    Li, Yanzhi
    Li, Guohui
    Wang, Kaifeng
    Wang, Zumin
    Chen, Yanqiu
    FIRE-SWITZERLAND, 2024, 7 (01):