Identifying viruses from metagenomic data using deep learning

被引:348
作者
Ren, Jie [1 ]
Song, Kai [2 ]
Deng, Chao [1 ]
Ahlgren, Nathan A. [3 ]
Fuhrman, Jed A. [4 ]
Li, Yi [5 ]
Xie, Xiaohui [5 ]
Poplin, Ryan [6 ]
Sun, Fengzhu [1 ]
机构
[1] Univ Southern Calif, Quantitat & Computat Biol Program, Los Angeles, CA 90089 USA
[2] Qingdao Univ, Sch Math & Stat, Qingdao 266071, Peoples R China
[3] Clark Univ, Dept Biol, Worcester, MA 01610 USA
[4] Univ Southern Calif, Dept Biol Sci, Los Angeles, CA 90089 USA
[5] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
[6] Google Inc, Mountain View, CA 94043 USA
基金
中国国家自然科学基金; 美国国家卫生研究院; 美国国家科学基金会;
关键词
metagenome; deep learning; virus identification; machine learning; CODON USAGE; SEQUENCE; VIROME; DNA; ALIGNMENT; GENOME; CLASSIFICATION; PREDICTION;
D O I
10.1007/s40484-019-0187-4
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
AnstractBackgroundThe recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.MethodsHere we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.ResultsTrained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.ConclusionsPowered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
引用
收藏
页码:64 / 77
页数:14
相关论文
共 64 条
[1]   Alignment-free d2* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences [J].
Ahlgren, Nathan A. ;
Ren, Jie ;
Lu, Yang Young ;
Fuhrman, Jed A. ;
Sun, Fengzhu .
NUCLEIC ACIDS RESEARCH, 2017, 45 (01) :39-53
[2]   PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies [J].
Akhter, Sajia ;
Aziz, Ramy K. ;
Edwards, Robert A. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (16) :e126
[3]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[4]   MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins [J].
Amgarten, Deyvid ;
Braga, Lucas P. P. ;
da Silva, Aline M. ;
Setubal, Joao C. .
FRONTIERS IN GENETICS, 2018, 9
[5]  
Amodei Dario, 2016, PREPRINT, DOI 10.48550/ARXIV.1606.06565
[6]   PHASTER: a better, faster version of the PHAST phage search tool [J].
Arndt, David ;
Grant, Jason R. ;
Marcu, Ana ;
Sajed, Tanvir ;
Pon, Allison ;
Liang, Yongjie ;
Wishart, David S. .
NUCLEIC ACIDS RESEARCH, 2016, 44 (W1) :W16-W21
[7]  
Ba J.L, 2015, P 3 INT C LEARNING R
[8]   Fast and sensitive protein alignment using DIAMOND [J].
Buchfink, Benjamin ;
Xie, Chao ;
Huson, Daniel H. .
NATURE METHODS, 2015, 12 (01) :59-60
[10]   A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes [J].
Dutilh, Bas E. ;
Cassman, Noriko ;
McNair, Katelyn ;
Sanchez, Savannah E. ;
Silva, Genivaldo G. Z. ;
Boling, Lance ;
Barr, Jeremy J. ;
Speth, Daan R. ;
Seguritan, Victor ;
Aziz, Ramy K. ;
Felts, Ben ;
Dinsdale, Elizabeth A. ;
Mokili, John L. ;
Edwards, Robert A. .
NATURE COMMUNICATIONS, 2014, 5