Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

被引:206
作者
McIntyre, Alexa B. R. [1 ,2 ,3 ]
Ounit, Rachid [4 ]
Afshinnekoo, Ebrahim [2 ,3 ,5 ]
Prill, Robert J. [6 ]
Henaff, Elizabeth [2 ,3 ]
Alexander, Noah [2 ,3 ]
Minot, Samuel S. [7 ]
Danko, David [1 ,2 ,3 ]
Foox, Jonathan [2 ,3 ]
Ahsanuddin, Sofia [2 ,3 ]
Tighe, Scott [8 ]
Hasan, Nur A. [9 ,10 ]
Subramanian, Poorani [9 ]
Moffat, Kelly [9 ]
Levy, Shawn [11 ]
Lonardi, Stefano [4 ]
Greenfield, Nick [7 ]
Colwell, Rita R. [9 ,12 ]
Rosen, Gail L. [13 ]
Mason, Christopher E. [2 ,3 ,14 ]
机构
[1] Tri Inst Program Computat Biol & Med, New York, NY USA
[2] Weill Cornell Med, Dept Physiol & Biophys, New York, NY 10021 USA
[3] HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsau, New York, NY 10021 USA
[4] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
[5] New York Med Coll, Sch Med, Valhalla, NY 10595 USA
[6] IBM Almaden Res Ctr, Accelerated Discovery Lab, San Jose, CA 95120 USA
[7] One Codex, Reference Genom, San Francisco, CA 94103 USA
[8] Univ Vermont, Burlington, VT 05405 USA
[9] CosmosID Inc, Rockville, MD 20850 USA
[10] Univ Maryland, Inst Adv Comp Studies UMIACS, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
[11] HudsonAlpha Inst Biotechnol, Huntsville, AL 35806 USA
[12] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Baltimore, MD USA
[13] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
[14] Feil Family Brain & Mind Res Inst, New York, NY 10065 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Metagenomics; Shotgun sequencing; Taxonomy; Classification; Comparison; Ensemble methods; Metaclassification; Pathogen detection; TAXONOMIC CLASSIFICATION; GENERATION; SEQUENCE; COMMUNITIES; DIVERSITY; CATALOG; STRAIN;
D O I
10.1186/s13059-017-1299-7
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.
引用
收藏
页数:19
相关论文
共 71 条
[21]   ART: a next-generation sequencing read simulator [J].
Huang, Weichun ;
Li, Leping ;
Myers, Jason R. ;
Marth, Gabor T. .
BIOINFORMATICS, 2012, 28 (04) :593-594
[22]   MEGAN analysis of metagenomic data [J].
Huson, Daniel H. ;
Auch, Alexander F. ;
Qi, Ji ;
Schuster, Stephan C. .
GENOME RESEARCH, 2007, 17 (03) :377-386
[23]   Integrative analysis of environmental sequences using MEGAN4 [J].
Huson, Daniel H. ;
Mitra, Suparna ;
Ruscheweyh, Hans-Joachim ;
Weber, Nico ;
Schuster, Stephan C. .
GENOME RESEARCH, 2011, 21 (09) :1552-1560
[24]  
IMMSA, 2016, MISS STAT
[25]   The distribution, diversity, and importance of 16S rRNA gene introns in the order Thermoproteales [J].
Jay, Zackary J. ;
Inskeep, William P. .
BIOLOGY DIRECT, 2015, 10
[26]   Scaffolding of a bacterial genome using MinION nanopore sequencing [J].
Karlsson, E. ;
Larkeryd, A. ;
Sjodin, A. ;
Forsman, M. ;
Stenberg, P. .
SCIENTIFIC REPORTS, 2015, 5
[27]   Reducing assembly complexity of microbial genomes with single-molecule sequencing [J].
Koren, Sergey ;
Harhay, Gregory P. ;
Smith, Timothy P. L. ;
Bono, James L. ;
Harhay, Dayna M. ;
Mcvey, Scott D. ;
Radune, Diana ;
Bergman, Nicholas H. ;
Phillippy, Adam M. .
GENOME BIOLOGY, 2013, 14 (09)
[28]   Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing [J].
Koslicki, David ;
Foucart, Simon ;
Rosen, Gail .
BIOINFORMATICS, 2013, 29 (17) :2096-2102
[29]   Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains [J].
Lan, Yemin ;
Rosen, Gail ;
Hershberg, Ruth .
MICROBIOME, 2016, 4
[30]   An evaluation of the accuracy and speed of metagenome analysis tools [J].
Lindgreen, Stinus ;
Adair, Karen L. ;
Gardner, Paul P. .
SCIENTIFIC REPORTS, 2016, 6