Viral genome prediction from raw human DNA sequence samples by combining natural language processing and machine learning techniques

被引:5
|
作者
Alshayeji, Mohammad H. [1 ]
Sindhu, Silpa ChandraBhasi [2 ]
Abed, Saed [1 ]
机构
[1] Kuwait Univ, Coll Engn & Petr, Comp Engn Dept, POB 5969,Safat, Kuwait 13060, Kuwait
[2] Different Media, POB 14390, Faiha, Kuwait
关键词
Metagenome; Machine learning; Human DNA; NLP; K-mer counting; Bag of;
D O I
10.1016/j.eswa.2023.119641
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Infection with a virus can lead to a range of illnesses in humans, including cancer. When viruses infect a host, they may disrupt normal host function and cause deadly diseases. Understanding complicated viral illnesses requires novel viral genome prediction. Since many of the sequences in assembled contigs from human samples are not identical to known genomes, many assembled contigs are labeled "unknown" by conventional align-ments. In this study, sequences from 19 metagenomic investigations were used to create the model proposed here, and these sequences were examined and classified using BLAST. We implemented k-mer counting and the bag-of-words technique using CountVectorizer. As far as we are aware, this work represents the first framework that combines natural language processing (NLP) along with traditional ML classification approaches on raw metagenomic contigs to automatically identify viruses in a variety of human biospecimens. The suggested models are general rather than specialized to a particular viral family. Since the proposed methodology is precise and simple, we may incorporate it into computer-aided diagnosis (CAD) systems to make day-to-day hospital ac-tivities easier. In the last stage, binary classification of deoxyribonucleic acid (DNA) with normal and viral ge-nomes was performed using traditional ML classifiers. Using the KNN classifier, the suggested model achieved 98.6% classification accuracy along with 98.5% precision, 98.6% recall, 0.984 F1 score, 0.896 Matthews cor-relation coefficient, 0.895 kappa, 0.97 classification success index and detection rate of 98.6% for the prediction of viral genomes in DNA. Compared to previously developed ML techniques, the model achieved a significantly greater performance for viral genome prediction.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Experimental Disease Prediction Research on Combining Natural Language Processing and Machine Learning
    Yu, Hong Qing
    PROCEEDINGS OF 2019 IEEE 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2019), 2019, : 145 - 150
  • [2] Machine Learning and Natural Language Processing for Prediction of Human Factors in Aviation Incident Reports
    Madeira, Tomas
    Melicio, Rui
    Valerio, Duarte
    Santos, Luis
    AEROSPACE, 2021, 8 (02) : 1 - 18
  • [3] Machine Learning Techniques for Biomedical Natural Language Processing: A Comprehensive Review
    Houssein, Essam H.
    Mohamed, Rehab E.
    Ali, Abdelmgeid A.
    IEEE ACCESS, 2021, 9 : 140628 - 140653
  • [4] Splicing sites prediction of human genome using machine learning techniques
    Waseem Ullah
    Khan Muhammad
    Ijaz Ul Haq
    Amin Ullah
    Saeed Ullah Khattak
    Muhammad Sajjad
    Multimedia Tools and Applications, 2021, 80 : 30439 - 30460
  • [5] Splicing sites prediction of human genome using machine learning techniques
    Ullah, Waseem
    Muhammad, Khan
    Ul Haq, Ijaz
    Ullah, Amin
    Ullah Khattak, Saeed
    Sajjad, Muhammad
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (20) : 30439 - 30460
  • [6] ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples
    Tampuu, Ardi
    Bzhalava, Zurab
    Dillner, Joakim
    Vicente, Raul
    PLOS ONE, 2019, 14 (09):
  • [7] Populating an allergens ontology using natural language processing and machine learning techniques
    Valarakos, AG
    Karkaletsis, V
    Alexopoulou, D
    Papadimitriou, E
    Spyropoulos, CD
    ARTIFICIAL INTELLIGENCE IN MEDICINE, PROCEEDINGS, 2005, 3581 : 256 - 265
  • [8] Resume Classification System using Natural Language Processing and Machine Learning Techniques
    Ali, Irfan
    Mughal, Nimra
    Khand, Zahid Hussain
    Ahmed, Javed
    Mujtaba, Ghulam
    MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2022, 41 (01) : 65 - 79
  • [9] Nursing innovations in machine learning: Using Natural Language Processing in Falls Prediction
    Solberg, L. M.
    Ingibjargardottir, R.
    Wu, Y.
    Lucero, R.
    JOURNAL OF THE AMERICAN GERIATRICS SOCIETY, 2020, 68 : S48 - S49
  • [10] Financial Risk Prediction and Management using Machine Learning and Natural Language Processing
    Li, Tianyu
    Dai, Xiangyu
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (06) : 211 - 219