Accurately identifying hemagglutinin using sequence information and machine learning methods

被引:60
作者
Zou, Xidan [1 ]
Ren, Liping [2 ]
Cai, Peiling [3 ]
Zhang, Yang [4 ]
Ding, Hui [1 ]
Deng, Kejun [1 ]
Yu, Xiaolong [5 ]
Lin, Hao [1 ]
Huang, Chengbing [6 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Informat Biol, Sch Life Sci & Technol, Chengdu, Peoples R China
[2] Chengdu Neusoft Univ, Sch Healthcare Technol, Chengdu, Peoples R China
[3] Chengdu Univ, Sch Basic Med Sci, Chengdu, Peoples R China
[4] Chengdu Univ Tradit Chinese Med, Innovat Inst Chinese Med & Pharm, Acad Interdiscipline, Chengdu, Peoples R China
[5] Hainan Univ, Sch Mat Sci & Engn, Haikou, Peoples R China
[6] Aba Teachers Univ, Sch Comp Sci & Technol, Aba, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
hemagglutinin; machine learning; sequence features; feature extraction; stacking; BINDING; PREDICTION; PROTEIN; TOOL;
D O I
10.3389/fmed.2023.1281880
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
IntroductionHemagglutinin (HA) is responsible for facilitating viral entry and infection by promoting the fusion between the host membrane and the virus. Given its significance in the process of influenza virus infestation, HA has garnered attention as a target for influenza drug and vaccine development. Thus, accurately identifying HA is crucial for the development of targeted vaccine drugs. However, the identification of HA using in-silico methods is still lacking. This study aims to design a computational model to identify HA.MethodsIn this study, a benchmark dataset comprising 106 HA and 106 non-HA sequences were obtained from UniProt. Various sequence-based features were used to formulate samples. By perform feature optimization and inputting them four kinds of machine learning methods, we constructed an integrated classifier model using the stacking algorithm.Results and discussionThe model achieved an accuracy of 95.85% and with an area under the receiver operating characteristic (ROC) curve of 0.9863 in the 5-fold cross-validation. In the independent test, the model exhibited an accuracy of 93.18% and with an area under the ROC curve of 0.9793. The code can be found from https://github.com/Zouxidan/HA_predict.git. The proposed model has excellent prediction performance. The model will provide convenience for biochemical scholars for the study of HA.
引用
收藏
页数:9
相关论文
共 56 条
  • [21] HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation
    Hasan, Md. Mehedi
    Schaduangrat, Nalini
    Basith, Shaherin
    Lee, Gwang
    Shoombuatong, Watshara
    Manavalan, Balachandran
    [J]. BIOINFORMATICS, 2020, 36 (11) : 3350 - 3356
  • [22] i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome
    Hasan, Md Mehedi
    Manavalan, Balachandran
    Khatun, Mst Shamima
    Kurata, Hiroyuki
    [J]. INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2020, 157 : 752 - 758
  • [23] TACOS: a novel approach for accurate prediction of cell-specific long noncoding RNAs subcellular localization
    Jeon, Young-Jun
    Hasan, Md Mehedi
    Park, Hyun Woo
    Lee, Ki Wook
    Manavalan, Balachandran
    [J]. BRIEFINGS IN BIOINFORMATICS, 2022, 23 (04)
  • [24] ATGPred-FL: sequence-based prediction of autophagy proteins with feature representation learning
    Jiao, Shihu
    Chen, Zheng
    Zhang, Lichao
    Zhou, Xun
    Shi, Lei
    [J]. AMINO ACIDS, 2022, 54 (05) : 799 - 809
  • [25] Identifying mutation positions in all segments of influenza genome enables better differentiation between pandemic and seasonal strains
    Kargarfard, Fatemeh
    Sami, Ashkan
    Hemmatzadeh, Farhid
    Ebrahimie, Esmaeil
    [J]. GENE, 2019, 697 : 78 - 85
  • [26] Influenza
    Krammer, Florian
    Smith, Gavin J. D.
    Fouchier, Ron A. M.
    Peiris, Malik
    Kedzierska, Katherine
    Doherty, Peter C.
    Palese, Peter
    Shaw, Megan L.
    Treanor, John
    Webster, Robert G.
    Garcia-Sastre, Adolfo
    [J]. NATURE REVIEWS DISEASE PRIMERS, 2018, 4 : 1 - 21
  • [27] iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition
    Liu, Bin
    Xu, Jinghao
    Lan, Xun
    Xu, Ruifeng
    Zhou, Jiyun
    Wang, Xiaolong
    Chou, Kuo-Chen
    [J]. PLOS ONE, 2014, 9 (09):
  • [28] DeepKla: An attention mechanism-based deep neural network for protein lysine lactylation site prediction
    Lv, Hao
    Dao, Fu-Ying
    Lin, Hao
    [J]. IMETA, 2022, 1 (01):
  • [29] MLCPP 2.0: An Updated Cell-penetrating Peptides and Their Uptake Efficiency Predictor
    Manavalan, Balachandran
    Patra, Mahesh Chandra
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2022, 434 (11)
  • [30] Computational prediction of species-specific yeast DNA replication origin via iterative feature representation
    Manavalan, Balachandran
    Basith, Shaherin
    Shin, Tae Hwan
    Lee, Gwang
    [J]. BRIEFINGS IN BIOINFORMATICS, 2021, 22 (04)