iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

被引:150
作者
Roux, Simon E. [1 ]
Camargo, Antonio Pedro [1 ]
Coutinho, Felipe H. [2 ]
Dabdoub, Shareef M. [3 ]
Dutilh, Bas E. [4 ,5 ]
Nayfach, Stephen [1 ]
Tritt, Andrew [6 ]
机构
[1] Lawrence Berkeley Natl Lab, DOE Joint Genome Inst, Berkeley, CA 94720 USA
[2] Inst Ciencias Mar ICM CSIC, Barcelona, Spain
[3] Univ Iowa, Div Biostat & Computat Biol, Coll Dent, Iowa City, IA USA
[4] Friedrich Schiller Univ, Inst Biodivers, Fac Biol Sci, Cluster Excellence Balance Microverse, Jena, Germany
[5] Univ Utrecht, Theoret Biol & Bioinformat, Sci Life, Utrecht, Netherlands
[6] Lawrence Berkeley Natl Lab, Computat Res Div, Berkeley, CA USA
基金
欧洲研究理事会;
关键词
CRASSPHAGE; TRACKING; PHAGE; TOOLS;
D O I
10.1371/journal.pbio.3002083
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived sequences lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e., for a number of viruses they yield erroneous predictions or no prediction at all. Here, we describe iPHoP, a two-step framework that integrates multiple methods to reliably predict host taxonomy at the genus rank for a broad range of viruses infecting bacteria and archaea, while retaining a low false discovery rate. Based on a large dataset of metagenome-derived virus genomes from the IMG/VR database, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses.
引用
收藏
页数:26
相关论文
共 73 条
[41]  
Nayfach S, 2021, NAT BIOTECHNOL, V39, P499, DOI 10.1038/s41587-020-0718-6
[42]   Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation [J].
O'Leary, Nuala A. ;
Wright, Mathew W. ;
Brister, J. Rodney ;
Ciufo, Stacy ;
McVeigh, Diana Haddad Rich ;
Rajput, Bhanu ;
Robbertse, Barbara ;
Smith-White, Brian ;
Ako-Adjei, Danso ;
Astashyn, Alexander ;
Badretdin, Azat ;
Bao, Yiming ;
Blinkova, Olga ;
Brover, Vyacheslav ;
Chetvernin, Vyacheslav ;
Choi, Jinna ;
Cox, Eric ;
Ermolaeva, Olga ;
Farrell, Catherine M. ;
Goldfarb, Tamara ;
Gupta, Tripti ;
Haft, Daniel ;
Hatcher, Eneida ;
Hlavina, Wratko ;
Joardar, Vinita S. ;
Kodali, Vamsi K. ;
Li, Wenjun ;
Maglott, Donna ;
Masterson, Patrick ;
McGarvey, Kelly M. ;
Murphy, Michael R. ;
O'Neill, Kathleen ;
Pujar, Shashikant ;
Rangwala, Sanjida H. ;
Rausch, Daniel ;
Riddick, Lillian D. ;
Schoch, Conrad ;
Shkeda, Andrei ;
Storz, Susan S. ;
Sun, Hanzhen ;
Thibaud-Nissen, Francoise ;
Tolstoy, Igor ;
Tully, Raymond E. ;
Vatsan, Anjana R. ;
Wallin, Craig ;
Webb, David ;
Wu, Wendy ;
Landrum, Melissa J. ;
Kimchi, Avi ;
Tatusova, Tatiana .
NUCLEIC ACIDS RESEARCH, 2016, 44 (D1) :D733-D745
[43]   dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication [J].
Olm, Matthew R. ;
Brown, Christopher T. ;
Brooks, Brandon ;
Banfield, Jillian F. .
ISME JOURNAL, 2017, 11 (12) :2864-2868
[44]   Mash: fast genome and metagenome distance estimation using MinHash [J].
Ondov, Brian D. ;
Treangen, Todd J. ;
Melsted, Pall ;
Mallonee, Adam B. ;
Bergman, Nicholas H. ;
Koren, Sergey ;
Phillippy, Adam M. .
GENOME BIOLOGY, 2016, 17
[45]   GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy [J].
Parks, Donovan H. ;
Chuvochina, Maria ;
Rinke, Christian ;
Mussig, Aaron J. ;
Chaumeil, Pierre-Alain ;
Hugenholtz, Philip .
NUCLEIC ACIDS RESEARCH, 2022, 50 (D1) :D785-D794
[46]   CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes [J].
Parks, Donovan H. ;
Imelfort, Michael ;
Skennerton, Connor T. ;
Hugenholtz, Philip ;
Tyson, Gene W. .
GENOME RESEARCH, 2015, 25 (07) :1043-1055
[47]   VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families [J].
Pons, Joan Carles ;
Paez-Espino, David ;
Riera, Gabriel ;
Ivanova, Natalia ;
Kyrpides, Nikos C. ;
Llabres, Merce .
BIOINFORMATICS, 2021, 37 (13) :1805-1813
[48]  
R Core Team, 2022, R: A Language and Environment for Statistical Computing
[49]   IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses [J].
Roux, Simon ;
Paez-Espino, David ;
Chen, I-Min A. ;
Palaniappan, Krishna ;
Ratner, Anna ;
Chu, Ken ;
Reddy, T. B. K. ;
Nayfach, Stephen ;
Schulz, Frederik ;
Call, Lee ;
Neches, Russell Y. ;
Woyke, Tanja ;
Ivanova, Natalia N. ;
Eloe-Fadrosh, Emiley A. ;
Kyrpides, Nikos C. .
NUCLEIC ACIDS RESEARCH, 2021, 49 (D1) :D764-D775
[50]   Minimum Information about an Uncultivated Virus Genome (MIUViG) [J].
Roux, Simon ;
Adriaenssens, Evelien M. ;
Dutilh, Bas E. ;
Koonin, Eugene V. ;
Kropinski, Andrew M. ;
Krupovic, Mart ;
Kuhn, Jens H. ;
Lavigne, Rob ;
Brister, J. Rodney ;
Varsani, Arvind ;
Amid, Clara ;
Aziz, Ramy K. ;
Bordenstein, Seth R. ;
Bork, Peer ;
Breitbart, Mya ;
Cochrane, Guy R. ;
Daly, Rebecca A. ;
Desnues, Christelle ;
Duhaime, Melissa B. ;
Emerson, Joanne B. ;
Enault, Francois ;
Fuhrman, Jed A. ;
Hingamp, Pascal ;
Hugenholtz, Philip ;
Hurwitz, Bonnie L. ;
Ivanova, Natalia N. ;
Labonte, Jessica M. ;
Lee, Kyung-Bum ;
Malmstrom, Rex R. ;
Martinez-Garcia, Manuel ;
Mizrachi, Ilene Karsch ;
Ogata, Hiroyuki ;
Paez-Espino, David ;
Petit, Marie-Agnes ;
Putonti, Catherine ;
Rattei, Thomas ;
Reyes, Alejandro ;
Rodriguez-Valera, Francisco ;
Rosario, Karyna ;
Schriml, Lynn ;
Schulz, Frederik ;
Steward, Grieg F. ;
Sullivan, Matthew B. ;
Sunagawa, Shinichi ;
Suttle, Curtis A. ;
Temperton, Ben ;
Tringe, Susannah G. ;
Thurber, Rebecca Vega ;
Webster, Nicole S. ;
Whiteson, Katrine L. .
NATURE BIOTECHNOLOGY, 2019, 37 (01) :29-37