Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

被引:198
作者
McIntyre, Alexa B. R. [1 ,2 ,3 ]
Ounit, Rachid [4 ]
Afshinnekoo, Ebrahim [2 ,3 ,5 ]
Prill, Robert J. [6 ]
Henaff, Elizabeth [2 ,3 ]
Alexander, Noah [2 ,3 ]
Minot, Samuel S. [7 ]
Danko, David [1 ,2 ,3 ]
Foox, Jonathan [2 ,3 ]
Ahsanuddin, Sofia [2 ,3 ]
Tighe, Scott [8 ]
Hasan, Nur A. [9 ,10 ]
Subramanian, Poorani [9 ]
Moffat, Kelly [9 ]
Levy, Shawn [11 ]
Lonardi, Stefano [4 ]
Greenfield, Nick [7 ]
Colwell, Rita R. [9 ,12 ]
Rosen, Gail L. [13 ]
Mason, Christopher E. [2 ,3 ,14 ]
机构
[1] Tri Inst Program Computat Biol & Med, New York, NY USA
[2] Weill Cornell Med, Dept Physiol & Biophys, New York, NY 10021 USA
[3] HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsau, New York, NY 10021 USA
[4] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA
[5] New York Med Coll, Sch Med, Valhalla, NY 10595 USA
[6] IBM Almaden Res Ctr, Accelerated Discovery Lab, San Jose, CA 95120 USA
[7] One Codex, Reference Genom, San Francisco, CA 94103 USA
[8] Univ Vermont, Burlington, VT 05405 USA
[9] CosmosID Inc, Rockville, MD 20850 USA
[10] Univ Maryland, Inst Adv Comp Studies UMIACS, Ctr Bioinformat & Computat Biol, College Pk, MD 20742 USA
[11] HudsonAlpha Inst Biotechnol, Huntsville, AL 35806 USA
[12] Johns Hopkins Univ, Bloomberg Sch Publ Hlth, Baltimore, MD USA
[13] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
[14] Feil Family Brain & Mind Res Inst, New York, NY 10065 USA
来源
GENOME BIOLOGY | 2017年 / 18卷
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Metagenomics; Shotgun sequencing; Taxonomy; Classification; Comparison; Ensemble methods; Metaclassification; Pathogen detection; TAXONOMIC CLASSIFICATION; GENERATION; SEQUENCE; COMMUNITIES; DIVERSITY; CATALOG; STRAIN;
D O I
10.1186/s13059-017-1299-7
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. Results: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. Conclusions: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.
引用
收藏
页数:19
相关论文
共 71 条
  • [1] Lack of Evidence for Plague or Anthrax on the New York City Subway
    Ackelsberg, Joel
    Rakeman, Jennifer
    Hughes, Scott
    Petersen, Jeannine
    Mead, Paul
    Schriefer, Martin
    Kingry, Luke
    Hoffmaster, Alex
    Gee, Jay E.
    [J]. CELL SYSTEMS, 2015, 1 (01) : 4 - 5
  • [2] Afshinnekoo E, 2015, CELL SYST, V1, P97, DOI 10.1016/j.cels.2015.07.006
  • [3] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [4] Scalable metagenomic taxonomy classification using a reference genome database
    Ames, Sasha K.
    Hysom, David A.
    Gardner, Shea N.
    Lloyd, G. Scott
    Gokhale, Maya B.
    Allen, Jonathan E.
    [J]. BIOINFORMATICS, 2013, 29 (18) : 2253 - 2260
  • [5] A comparative evaluation of sequence classification programs
    Bazinet, Adam L.
    Cummings, Michael P.
    [J]. BMC BIOINFORMATICS, 2012, 13
  • [6] Microbial Community Patterns Associated with Automated Teller Machine Keypads in New York City
    Bik, Holly M.
    Maritz, Julia M.
    Luong, Albert
    Shin, Hakdong
    Dominguez-Bello, Maria Gloria
    Carlton, Jane M.
    [J]. MSPHERE, 2016, 1 (06):
  • [7] Boyd Kendrick, 2013, Machine Learning and Knowledge Discovery in Databases. European Conference, ECML PKDD 2013. Proceedings: LNCS 8190, P451, DOI 10.1007/978-3-642-40994-3_29
  • [8] Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis
    Bradley, Phelim
    Gordon, N. Claire
    Walker, Timothy M.
    Dunn, Laura
    Heys, Simon
    Huang, Bill
    Earle, Sarah
    Pankhurst, Louise J.
    Anson, Luke
    de Cesare, Mariateresa
    Piazza, Paolo
    Votintseva, Antonina A.
    Golubchik, Tanya
    Wilson, Daniel J.
    Wyllie, David H.
    Diel, Roland
    Niemann, Stefan
    Feuerriegel, Silke
    Kohl, Thomas A.
    Ismail, Nazir
    Omar, Shaheed V.
    Smith, E. Grace
    Buck, David
    McVean, Gil
    Walker, A. Sarah
    Peto, Tim E. A.
    Crook, Derrick W.
    Iqbal, Zamin
    [J]. NATURE COMMUNICATIONS, 2015, 6
  • [9] Fast and sensitive protein alignment using DIAMOND
    Buchfink, Benjamin
    Xie, Chao
    Huson, Daniel H.
    [J]. NATURE METHODS, 2015, 12 (01) : 59 - 60
  • [10] Cao Minh Duc., 2016, bioRxiv, page, P054783