Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics

被引:91
作者
Jonsson, Viktor [1 ]
Osterlund, Tobias
Nerman, Olle
Kristiansson, Erik [1 ]
机构
[1] Chalmers Univ Technol, Dept Math Sci, SE-41296 Gothenburg, Sweden
来源
BMC GENOMICS | 2016年 / 17卷
基金
瑞典研究理事会;
关键词
Environmental sequencing; Next generation sequencing; Categorical data analysis; Differential abundance; Receiver operating characteristic; False discovery rate; FALSE DISCOVERY RATE; EXPRESSION ANALYSIS; DATA-MANAGEMENT; PACKAGE; IMG/M;
D O I
10.1186/s12864-016-2386-y
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Metagenomics is the study of microbial communities by sequencing of genetic material directly from environmental or clinical samples. The genes present in the metagenomes are quantified by annotating and counting the generated DNA fragments. Identification of differentially abundant genes between metagenomes can provide important information about differences in community structure, diversity and biological function. Metagenomic data is however high-dimensional, contain high levels of biological and technical noise and have typically few biological replicates. The statistical analysis is therefore challenging and many approaches have been suggested to date. Results: In this article we perform a comprehensive evaluation of 14 methods for identification of differentially abundant genes between metagenomes. The methods are compared based on the power to detect differentially abundant genes and their ability to correctly estimate the type I error rate and the false discovery rate. We show that sample size, effect size, and gene abundance greatly affect the performance of all methods. Several of the methods also show non-optimal model assumptions and biased false discovery rate estimates, which can result in too large numbers of false positives. We also demonstrate that the performance of several of the methods differs substantially between metagenomic data sequenced by different technologies. Conclusions: Two methods, primarily designed for the analysis of RNA sequencing data (edgeR and DESeq2) together with a generalized linear model based on an overdispersed Poisson distribution were found to have best overall performance. The results presented in this study may serve as a guide for selecting suitable statistical methods for identification of differentially abundant genes in metagenomes.
引用
收藏
页数:14
相关论文
共 62 条
[1]  
Alneberg J, 2014, NAT METHODS, V11, P1144, DOI [10.1038/NMETH.3103, 10.1038/nmeth.3103]
[2]   Differential expression analysis for sequence count data [J].
Anders, Simon ;
Huber, Wolfgang .
GENOME BIOLOGY, 2010, 11 (10)
[3]   THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA [J].
ANSCOMBE, FJ .
BIOMETRIKA, 1948, 35 (3-4) :246-254
[4]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[5]   Average genome size: a potential source of bias in comparative metagenomics [J].
Beszteri, Bank ;
Temperton, Ben ;
Frickenhaus, Stephan ;
Giovannoni, Stephen J. .
ISME JOURNAL, 2010, 4 (08) :1075-1077
[6]   Tentacle: distributed quantification of genes in metagenomes [J].
Boulund, Fredrik ;
Sjogren, Anders ;
Kristiansson, Erik .
GIGASCIENCE, 2015, 4
[7]   The Sphagnum microbiome supports bog ecosystem functioning under extreme conditions [J].
Bragina, Anastasia ;
Oberauner-Wappis, Lisa ;
Zachow, Christin ;
Halwachs, Bettina ;
Thallinger, Gerhard G. ;
Mueller, Henry ;
Berg, Gabriele .
MOLECULAR ECOLOGY, 2014, 23 (18) :4498-4510
[8]   The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies [J].
Brooks, J. Paul ;
Edwards, David J. ;
Harwich, Michael D., Jr. ;
Rivera, Maria C. ;
Fettweis, Jennifer M. ;
Serrano, Myrna G. ;
Reris, Robert A. ;
Sheth, Nihar U. ;
Huang, Bernice ;
Girerd, Philippe ;
Strauss, Jerome F., III ;
Jefferson, Kimberly K. ;
Buck, Gregory A. .
BMC MICROBIOLOGY, 2015, 15
[9]  
Casella G., 2002, STAT INFERENCE, V2
[10]   The effects of variable sample biomass on comparative metagenomics [J].
Chafee, Meghan ;
Maignien, Lois ;
Simmons, Sheri L. .
ENVIRONMENTAL MICROBIOLOGY, 2015, 17 (07) :2239-2253