Accurately identifying hemagglutinin using sequence information and machine learning methods

被引:60
作者
Zou, Xidan [1 ]
Ren, Liping [2 ]
Cai, Peiling [3 ]
Zhang, Yang [4 ]
Ding, Hui [1 ]
Deng, Kejun [1 ]
Yu, Xiaolong [5 ]
Lin, Hao [1 ]
Huang, Chengbing [6 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Informat Biol, Sch Life Sci & Technol, Chengdu, Peoples R China
[2] Chengdu Neusoft Univ, Sch Healthcare Technol, Chengdu, Peoples R China
[3] Chengdu Univ, Sch Basic Med Sci, Chengdu, Peoples R China
[4] Chengdu Univ Tradit Chinese Med, Innovat Inst Chinese Med & Pharm, Acad Interdiscipline, Chengdu, Peoples R China
[5] Hainan Univ, Sch Mat Sci & Engn, Haikou, Peoples R China
[6] Aba Teachers Univ, Sch Comp Sci & Technol, Aba, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
hemagglutinin; machine learning; sequence features; feature extraction; stacking; BINDING; PREDICTION; PROTEIN; TOOL;
D O I
10.3389/fmed.2023.1281880
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
IntroductionHemagglutinin (HA) is responsible for facilitating viral entry and infection by promoting the fusion between the host membrane and the virus. Given its significance in the process of influenza virus infestation, HA has garnered attention as a target for influenza drug and vaccine development. Thus, accurately identifying HA is crucial for the development of targeted vaccine drugs. However, the identification of HA using in-silico methods is still lacking. This study aims to design a computational model to identify HA.MethodsIn this study, a benchmark dataset comprising 106 HA and 106 non-HA sequences were obtained from UniProt. Various sequence-based features were used to formulate samples. By perform feature optimization and inputting them four kinds of machine learning methods, we constructed an integrated classifier model using the stacking algorithm.Results and discussionThe model achieved an accuracy of 95.85% and with an area under the receiver operating characteristic (ROC) curve of 0.9863 in the 5-fold cross-validation. In the independent test, the model exhibited an accuracy of 93.18% and with an area under the ROC curve of 0.9793. The code can be found from https://github.com/Zouxidan/HA_predict.git. The proposed model has excellent prediction performance. The model will provide convenience for biochemical scholars for the study of HA.
引用
收藏
页数:9
相关论文
共 56 条
  • [1] [Anonymous], 2023, SBSM-Pro: Support bio-sequence machine for proteins
  • [2] m5U-SVM: identification of RNA 5-methyluridine modification sites based on multi-view features of physicochemical features and distributed representation
    Ao, Chunyan
    Ye, Xiucai
    Sakurai, Tetsuya
    Zou, Quan
    Yu, Liang
    [J]. BMC BIOLOGY, 2023, 21 (01)
  • [3] Biological Sequence Classification: A Review on Data and General Methods
    Ao, Chunyan
    Jiao, Shihu
    Wang, Yansu
    Yu, Liang
    Zou, Quan
    [J]. RESEARCH, 2022, 2022
  • [4] STALLION: a stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction
    Basith, Shaherin
    Lee, Gwang
    Manavalan, Balachandran
    [J]. BRIEFINGS IN BIOINFORMATICS, 2022, 23 (01)
  • [5] Integrative machine learning framework for the identification of cell-specific enhancers from the human genome
    Basith, Shaherin
    Hasan, Md Mehedi
    Lee, Gwang
    Wei, Leyi
    Manavalan, Balachandran
    [J]. BRIEFINGS IN BIOINFORMATICS, 2021, 22 (06)
  • [6] UniProt: the Universal Protein Knowledgebase in 2023
    Bateman, Alex
    Martin, Maria-Jesus
    Orchard, Sandra
    Magrane, Michele
    Ahmad, Shadab
    Alpi, Emanuele
    Bowler-Barnett, Emily H.
    Britto, Ramona
    Cukura, Austra
    Denny, Paul
    Dogan, Tunca
    Ebenezer, ThankGod
    Fan, Jun
    Garmiri, Penelope
    Gonzales, Leonardo Jose da Costa
    Hatton-Ellis, Emma
    Hussein, Abdulrahman
    Ignatchenko, Alexandr
    Insana, Giuseppe
    Ishtiaq, Rizwan
    Joshi, Vishal
    Jyothi, Dushyanth
    Kandasaamy, Swaathi
    Lock, Antonia
    Luciani, Aurelien
    Lugaric, Marija
    Luo, Jie
    Lussi, Yvonne
    MacDougall, Alistair
    Madeira, Fabio
    Mahmoudy, Mahdi
    Mishra, Alok
    Moulang, Katie
    Nightingale, Andrew
    Pundir, Sangya
    Qi, Guoying
    Raj, Shriya
    Raposo, Pedro
    Rice, Daniel L.
    Saidi, Rabie
    Santos, Rafael
    Speretta, Elena
    Stephenson, James
    Totoo, Prabhat
    Turner, Edward
    Tyagi, Nidhi
    Vasudev, Preethi
    Warner, Kate
    Watkins, Xavier
    Zellner, Hermann
    [J]. NUCLEIC ACIDS RESEARCH, 2023, 51 (D1) : D523 - D531
  • [7] Breiman L, 1996, MACH LEARN, V24, P49
  • [8] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [9] An Effective Integrated Machine Learning Framework for Identifying Severity of Tomato Yellow Leaf Curl Virus and Their Experimental Validation
    Bupi, Nattanong
    Sangaraju, Vinoth Kumar
    Phan, Le Thi
    Lal, Aamir
    Vo, Thuy Thi Bich
    Ho, Phuong Thi
    Qureshi, Muhammad Amir
    Tabassum, Marjia
    Lee, Sukchan
    Manavalan, Balachandran
    [J]. RESEARCH, 2023, 6
  • [10] INFINITy: A fast machine learning-based application for human influenza A and B virus subtyping
    Cacciabue, Marco
    Marcone, Debora N.
    [J]. INFLUENZA AND OTHER RESPIRATORY VIRUSES, 2023, 17 (01)