Comparison and evaluation of statistical error models for scRNA-seq

被引：233

作者：

Choudhary, Saket ^{[1
]}

Satija, Rahul ^{[1
,2
]}

机构：

[1] New York Genome Ctr, 101 Ave Amer, New York, NY 10013 USA

[2] NYU, Ctr Genom & Syst Biol, 12 Waverly Pl, New York, NY 10003 USA

来源：

GENOME BIOLOGY | 2022年 / 23卷 / 01期

关键词：

Single-cell RNA-seq; Normalization; Dimension reduction; Variable genes; Differential expression; Feature selection; DIFFERENTIAL EXPRESSION ANALYSIS; CELL RNA-SEQ; GENE-EXPRESSION; SINGLE; NOISE; VISUALIZATION; CHALLENGES;

D O I：

10.1186/s13059-021-02584-9

中图分类号：

Q81 [生物工程学（生物技术）]; Q93 [微生物学];

学科分类号：

071005 ; 0836 ; 090102 ; 100705 ;

摘要：

Background Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate. Results Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation. Conclusions Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.

引用

页数：20

共 77 条

[1]

Ahlmann-Eltze C., 2021, BIOINFORMATICS, DOI [DOI 10.1101/2021.06.24.449781, 10.1101/2021.06.24.449781], Patent No. 20210624449781

[2] glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data [J].

Ahlmann-Eltze, Constantin ;

Huber, Wolfgang .

BIOINFORMATICS, 2020, 36 (24) :5701-5702

[3]

Amrhein L, 2019, 657619 BIORXIV

[4]

Anders S., 2010, GENOME BIOL, V11, pR106, DOI DOI 10.1186/gb-2010-11-10-r106

[5] Detecting differential usage of exons from RNA-seq data [J].

Anders, Simon ;

Reyes, Alejandro ;

Huber, Wolfgang .

GENOME RESEARCH, 2012, 22 (10) :2008-2017

[6] M3Drop: dropout-based feature selection for scRNASeq [J].

Andrews, Tallulah S. ;

Hemberg, Martin .

BIOINFORMATICS, 2019, 35 (16) :2865-2867

[7]

[Anonymous], 1996, J COMPUT GRAPH STAT, DOI 10.2307/1390802

[8] Broad distribution spectrum from Gaussian to power law appears in stochastic variations in RNA-seq data [J].

Awazu, Akinori ;

Tanabe, Takahiro ;

Kamitani, Mari ;

Tezuka, Ayumi ;

Nagano, Atsushi J. .

SCIENTIFIC REPORTS, 2018, 8

[9] MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions [J].

Baran, Yael ;

Bercovich, Akhiad ;

Sebe-Pedros, Arnau ;

Lubling, Yaniv ;

Giladi, Amir ;

Chomsky, Elad ;

Meir, Zohar ;

Hoichman, Michael ;

Lifshitz, Aviezer ;

Tanay, Amos .

GENOME BIOLOGY, 2019, 20 (01)

[10] Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues [J].

Bartosovic, Marek ;

Kabbe, Mukund ;

Castelo-Branco, Goncalo .

NATURE BIOTECHNOLOGY, 2021, 39 (07) :825-835

← 1 2 3 4 5 6 7 8 →