Comparison and evaluation of statistical error models for scRNA-seq

被引:233
作者
Choudhary, Saket [1 ]
Satija, Rahul [1 ,2 ]
机构
[1] New York Genome Ctr, 101 Ave Amer, New York, NY 10013 USA
[2] NYU, Ctr Genom & Syst Biol, 12 Waverly Pl, New York, NY 10003 USA
关键词
Single-cell RNA-seq; Normalization; Dimension reduction; Variable genes; Differential expression; Feature selection; DIFFERENTIAL EXPRESSION ANALYSIS; CELL RNA-SEQ; GENE-EXPRESSION; SINGLE; NOISE; VISUALIZATION; CHALLENGES;
D O I
10.1186/s13059-021-02584-9
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate. Results Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation. Conclusions Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.
引用
收藏
页数:20
相关论文
共 77 条
[1]  
Ahlmann-Eltze C., 2021, BIOINFORMATICS, DOI [DOI 10.1101/2021.06.24.449781, 10.1101/2021.06.24.449781], Patent No. 20210624449781
[2]   glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data [J].
Ahlmann-Eltze, Constantin ;
Huber, Wolfgang .
BIOINFORMATICS, 2020, 36 (24) :5701-5702
[3]  
Amrhein L, 2019, 657619 BIORXIV
[4]  
Anders S., 2010, GENOME BIOL, V11, pR106, DOI DOI 10.1186/gb-2010-11-10-r106
[5]   Detecting differential usage of exons from RNA-seq data [J].
Anders, Simon ;
Reyes, Alejandro ;
Huber, Wolfgang .
GENOME RESEARCH, 2012, 22 (10) :2008-2017
[6]   M3Drop: dropout-based feature selection for scRNASeq [J].
Andrews, Tallulah S. ;
Hemberg, Martin .
BIOINFORMATICS, 2019, 35 (16) :2865-2867
[7]  
[Anonymous], 1996, J COMPUT GRAPH STAT, DOI 10.2307/1390802
[8]   Broad distribution spectrum from Gaussian to power law appears in stochastic variations in RNA-seq data [J].
Awazu, Akinori ;
Tanabe, Takahiro ;
Kamitani, Mari ;
Tezuka, Ayumi ;
Nagano, Atsushi J. .
SCIENTIFIC REPORTS, 2018, 8
[9]   MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions [J].
Baran, Yael ;
Bercovich, Akhiad ;
Sebe-Pedros, Arnau ;
Lubling, Yaniv ;
Giladi, Amir ;
Chomsky, Elad ;
Meir, Zohar ;
Hoichman, Michael ;
Lifshitz, Aviezer ;
Tanay, Amos .
GENOME BIOLOGY, 2019, 20 (01)
[10]   Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues [J].
Bartosovic, Marek ;
Kabbe, Mukund ;
Castelo-Branco, Goncalo .
NATURE BIOTECHNOLOGY, 2021, 39 (07) :825-835