Viral genome prediction from raw human DNA sequence samples by combining natural language processing and machine learning techniques

被引:5
作者
Alshayeji, Mohammad H. [1 ]
Sindhu, Silpa ChandraBhasi [2 ]
Abed, Saed [1 ]
机构
[1] Kuwait Univ, Coll Engn & Petr, Comp Engn Dept, POB 5969,Safat, Kuwait 13060, Kuwait
[2] Different Media, POB 14390, Faiha, Kuwait
关键词
Metagenome; Machine learning; Human DNA; NLP; K-mer counting; Bag of;
D O I
10.1016/j.eswa.2023.119641
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Infection with a virus can lead to a range of illnesses in humans, including cancer. When viruses infect a host, they may disrupt normal host function and cause deadly diseases. Understanding complicated viral illnesses requires novel viral genome prediction. Since many of the sequences in assembled contigs from human samples are not identical to known genomes, many assembled contigs are labeled "unknown" by conventional align-ments. In this study, sequences from 19 metagenomic investigations were used to create the model proposed here, and these sequences were examined and classified using BLAST. We implemented k-mer counting and the bag-of-words technique using CountVectorizer. As far as we are aware, this work represents the first framework that combines natural language processing (NLP) along with traditional ML classification approaches on raw metagenomic contigs to automatically identify viruses in a variety of human biospecimens. The suggested models are general rather than specialized to a particular viral family. Since the proposed methodology is precise and simple, we may incorporate it into computer-aided diagnosis (CAD) systems to make day-to-day hospital ac-tivities easier. In the last stage, binary classification of deoxyribonucleic acid (DNA) with normal and viral ge-nomes was performed using traditional ML classifiers. Using the KNN classifier, the suggested model achieved 98.6% classification accuracy along with 98.5% precision, 98.6% recall, 0.984 F1 score, 0.896 Matthews cor-relation coefficient, 0.895 kappa, 0.97 classification success index and detection rate of 98.6% for the prediction of viral genomes in DNA. Compared to previously developed ML techniques, the model achieved a significantly greater performance for viral genome prediction.
引用
收藏
页数:10
相关论文
共 23 条
[1]   Enhanced brain tumor classification using an optimized multi-layered convolutional neural network architecture [J].
Alshayeji, Mohammad ;
Al-Buloushi, Jassim ;
Ashkanani, Ali ;
Abed, Sa'ed .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (19) :28897-28917
[2]   MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins [J].
Amgarten, Deyvid ;
Braga, Lucas P. P. ;
da Silva, Aline M. ;
Setubal, Joao C. .
FRONTIERS IN GENETICS, 2018, 9
[3]  
BLAST, Basic Local Alignment Search Tool 2022
[4]   Machine Learning for detection of viral sequences in human metagenomic datasets [J].
Bzhalava, Zurab ;
Tampuu, Ardi ;
Bala, Piotr ;
Vicente, Raul ;
Dillner, Joakim .
BMC BIOINFORMATICS, 2018, 19
[5]   Extension of the viral ecology in humans using viral profile hidden Markov models [J].
Bzhalava, Zurab ;
Hultin, Emilie ;
Dillner, Joakim .
PLOS ONE, 2018, 13 (01)
[6]   16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets [J].
Chaudhary, Nikhil ;
Sharma, Ashok K. ;
Agarwal, Piyush ;
Gupta, Ankit ;
Sharma, Vineet K. .
PLOS ONE, 2015, 10 (02)
[7]   Explainable deep neural networks for novel viral genome prediction [J].
Dasari, Chandra Mohan ;
Bhukya, Raju .
APPLIED INTELLIGENCE, 2022, 52 (03) :3002-3017
[8]   Bag-of-Words Technique in Natural Language Processing: A Primer for Radiologists [J].
Juluru, Krishna ;
Shih, Hao-Hsin ;
Murthy, Krishna Nand Keshava ;
Elnajjar, Pierre .
RADIOGRAPHICS, 2021, 41 (05) :1420-1426
[9]   The human virome: assembly, composition and host interactions [J].
Liang, Guanxiang ;
Bushman, Frederic D. .
NATURE REVIEWS MICROBIOLOGY, 2021, 19 (08) :514-527
[10]   RNN-VirSeeker: A Deep Learning Method for Identification of Short Viral Sequences From Metagenomes [J].
Liu, Fu ;
Miao, Yan ;
Liu, Yun ;
Hou, Tao .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (03) :1840-1849