Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques

被引:0
作者
Ahmed M. A. Elsherbini
Amr Hassan Elkholy
Youssef M. Fadel
Gleb Goussarov
Ahmed Mohamed Elshal
Mohamed El-Hadidi
Mohamed Mysara
机构
[1] Nile University,Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science
[2] Belgian Nuclear Research Centre (SCK•CEN),Microbiology Unit
来源
BMC Bioinformatics | / 25卷
关键词
SARS-CoV-2; Genomic signature; Di nucleotide frequency; Tri nucleotide frequency; GenoSig; Deep Learning; Machine Learning; Random Forest;
D O I
暂无
中图分类号
学科分类号
摘要
The global spread of the SARS-CoV-2 pandemic, originating in Wuhan, China, has had profound consequences on both health and the economy. Traditional alignment-based phylogenetic tree methods for tracking epidemic dynamics demand substantial computational power due to the growing number of sequenced strains. Consequently, there is a pressing need for an alignment-free approach to characterize these strains and monitor the dynamics of various variants. In this work, we introduce a swift and straightforward tool named GenoSig, implemented in C++. The tool exploits the Di and Tri nucleotide frequency signatures to delineate the taxonomic lineages of SARS-CoV-2 by employing diverse machine learning (ML) and deep learning (DL) models. Our approach achieved a tenfold cross-validation accuracy of 87.88% (± 0.013) for DL and 86.37% (± 0.0009) for Random Forest (RF) model, surpassing the performance of other ML models. Validation using an additional unexposed dataset yielded comparable results. Despite variations in architectures between DL and RF, it was observed that later clades, specifically GRA, GRY, and GK, exhibited superior performance compared to earlier clades G and GH. As for the continental origin of the virus, both DL and RF models exhibited lower performance than in predicting clades. However, both models demonstrated relatively higher accuracy for Europe, North America, and South America compared to other continents, with DL outperforming RF. Both models consistently demonstrated a preference for cytosine and guanine over adenine and thymine in both clade and continental analyses, in both Di and Tri nucleotide frequencies signatures. Our findings suggest that GenoSig provides a straightforward approach to address taxonomic, epidemiological, and biological inquiries, utilizing a reductive method applicable not only to SARS-CoV-2 but also to similar research questions in an alignment-free context.
引用
收藏
相关论文
共 183 条
[1]  
Dong E(2020)An interactive web-based dashboard to track COVID-19 in real time Lancet Infect Dis 20 533-534
[2]  
Du H(2019)Bats and coronaviruses Viruses 11 41-294
[3]  
Gardner L(2021)SARS-CoV-2: origin, evolution, and targeting inhibition Front Cell Infect Microbiol 11 66-4448
[4]  
Banerjee A(2020)Coronavirus disease 2019 (COVID-19): current status and future perspectives Int J Antimicrob Agents 55 280-1075
[5]  
Kulcsar K(2022)Overview of SARS-CoV-2 genome-encoded proteins Sci China Life Sci 65 4433-78
[6]  
Misra V(2016)Mechanisms of viral mutation Cell Mol Life Sci 73 66-17
[7]  
Frieman M(2020)Animal and human RNA viruses: genetic variability and ability to overcome vaccines Arch Microbiol 6 66-195273
[8]  
Mossman K(2017)GISAID: global initiative on sharing all influenza data—from vision to reality Eurosurveillance 6 8435-42
[9]  
Ning S(2020)Geographic and genomic distribution of SARS-CoV-2 mutations Front Microbiol 11 66-654
[10]  
Yu B(2021)Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology Sci Rep 15 1065-225