Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies

被引:111
作者
Thorsen, Jonathan [1 ]
Brejnrod, Asker [2 ,3 ]
Mortensen, Martin [2 ]
Rasmussen, Morten A. [1 ]
Stokholm, Jakob [1 ]
Abu Al-Soud, Waleed [2 ]
Sorensen, Soren [2 ]
Bisgaard, Hans [1 ]
Waage, Johannes [1 ]
机构
[1] Univ Copenhagen, Herlev & Gentofte Hosp, COPSAC, Copenhagen Prospect Studies Asthma Childhood, Copenhagen, Denmark
[2] Univ Copenhagen, Microbiol Sect, Dept Biol, Copenhagen, Denmark
[3] Univ Copenhagen, Dept Biol, Lab Genom & Mol Biomed, Copenhagen, Denmark
来源
MICROBIOME | 2016年 / 4卷
关键词
16S sequencing; Microbiome; Benchmark; Differential relative abundance; Beta-diversity; DIFFERENTIAL EXPRESSION; PACKAGE;
D O I
10.1186/s40168-016-0208-8
中图分类号
Q93 [微生物学];
学科分类号
071005 ; 100705 ;
摘要
Background: There is an immense scientific interest in the human microbiome and its effects on human physiology, health, and disease. A common approach for examining bacterial communities is high-throughput sequencing of 16S rRNA gene hypervariable regions, aggregating sequence-similar amplicons into operational taxonomic units (OTUs). Strategies for detecting differential relative abundance of OTUs between sample conditions include classical statistical approaches as well as a plethora of newer methods, many borrowing from the related field of RNA-seq analysis. This effort is complicated by unique data characteristics, including sparsity, sequencing depth variation, and nonconformity of read counts to theoretical distributions, which is often exacerbated by exploratory and/or unbalanced study designs. Here, we assess the robustness of available methods for (1) inference in differential relative abundance analysis and (2) beta-diversity-based sample separation, using a rigorous benchmarking framework based on large clinical 16S microbiome datasets from different sources. Results: Running more than 380,000 full differential relative abundance tests on real datasets with permuted case/control assignments and in silico-spiked OTUs, we identify large differences in method performance on a range of parameters, including false positive rates, sensitivity to sparsity and case/control balances, and spike-in retrieval rate. In large datasets, methods with the highest false positive rates also tend to have the best detection power. For beta-diversity-based sample separation, we show that library size normalization has very little effect and that the distance metric is the most important factor in terms of separation power. Conclusions: Our results, generalizable to datasets from different sequencing platforms, demonstrate how the choice of method considerably affects analysis outcome. Here, we give recommendations for tools that exhibit low false positive rates, have good retrieval power across effect sizes and case/control proportions, and have low sparsity bias. Result output from some commonly used methods should be interpreted with caution. We provide an easily extensible framework for benchmarking of new methods and future microbiome datasets.
引用
收藏
页数:14
相关论文
共 44 条
[31]  
Pournelle G. H., 1953, Journal of Mammalogy, V34, P133, DOI 10.1890/0012-9658(2002)083[1421:SDEOLC]2.0.CO
[32]  
2
[33]   A human gut microbial gene catalogue established by metagenomic sequencing [J].
Qin, Junjie ;
Li, Ruiqiang ;
Raes, Jeroen ;
Arumugam, Manimozhiyan ;
Burgdorf, Kristoffer Solvsten ;
Manichanh, Chaysavanh ;
Nielsen, Trine ;
Pons, Nicolas ;
Levenez, Florence ;
Yamada, Takuji ;
Mende, Daniel R. ;
Li, Junhua ;
Xu, Junming ;
Li, Shaochuan ;
Li, Dongfang ;
Cao, Jianjun ;
Wang, Bo ;
Liang, Huiqing ;
Zheng, Huisong ;
Xie, Yinlong ;
Tap, Julien ;
Lepage, Patricia ;
Bertalan, Marcelo ;
Batto, Jean-Michel ;
Hansen, Torben ;
Le Paslier, Denis ;
Linneberg, Allan ;
Nielsen, H. Bjorn ;
Pelletier, Eric ;
Renault, Pierre ;
Sicheritz-Ponten, Thomas ;
Turner, Keith ;
Zhu, Hongmei ;
Yu, Chang ;
Li, Shengting ;
Jian, Min ;
Zhou, Yan ;
Li, Yingrui ;
Zhang, Xiuqing ;
Li, Songgang ;
Qin, Nan ;
Yang, Huanming ;
Wang, Jian ;
Brunak, Soren ;
Dore, Joel ;
Guarner, Francisco ;
Kristiansen, Karsten ;
Pedersen, Oluf ;
Parkhill, Julian ;
Weissenbach, Jean .
NATURE, 2010, 464 (7285) :59-U70
[34]   Vaginal microbiome of reproductive-age women [J].
Ravel, Jacques ;
Gajer, Pawel ;
Abdo, Zaid ;
Schneider, G. Maria ;
Koenig, Sara S. K. ;
McCulle, Stacey L. ;
Karlebach, Shara ;
Gorle, Reshma ;
Russell, Jennifer ;
Tacket, Carol O. ;
Brotman, Rebecca M. ;
Davis, Catherine C. ;
Ault, Kevin ;
Peralta, Ligia ;
Forney, Larry J. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 :4680-4687
[35]   Dealing with overdispersed count data in applied ecology [J].
Richards, Shane A. .
JOURNAL OF APPLIED ECOLOGY, 2008, 45 (01) :218-227
[36]   pROC: an open-source package for R and S plus to analyze and compare ROC curves [J].
Robin, Xavier ;
Turck, Natacha ;
Hainard, Alexandre ;
Tiberti, Natalia ;
Lisacek, Frederique ;
Sanchez, Jean-Charles ;
Mueller, Markus .
BMC BIOINFORMATICS, 2011, 12
[37]   Moderated statistical tests for assessing differences in tag abundance [J].
Robinson, Mark D. ;
Smyth, Gordon K. .
BIOINFORMATICS, 2007, 23 (21) :2881-2887
[38]   edgeR: a Bioconductor package for differential expression analysis of digital gene expression data [J].
Robinson, Mark D. ;
McCarthy, Davis J. ;
Smyth, Gordon K. .
BIOINFORMATICS, 2010, 26 (01) :139-140
[39]   Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis [J].
Schloss, Patrick D. ;
Westcott, Sarah L. .
APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2011, 77 (10) :3219-3226
[40]   A PLACE FOR DNA-DNA REASSOCIATION AND 16S RIBOSOMAL-RNA SEQUENCE-ANALYSIS IN THE PRESENT SPECIES DEFINITION IN BACTERIOLOGY [J].
STACKEBRANDT, E ;
GOEBEL, BM .
INTERNATIONAL JOURNAL OF SYSTEMATIC BACTERIOLOGY, 1994, 44 (04) :846-849