A UNIFIED STATISTICAL FRAMEWORK FOR SINGLE CELL AND BULK RNA SEQUENCING DATA

被引:51
作者
Zhu, Lingxue [1 ]
Lei, Jing [1 ]
Devlin, Bernie [2 ]
Roeder, Kathryn [1 ]
机构
[1] Carnegie Mellon Univ, Dept Stat, 5000 Forbes Ave, Pittsburgh, PA 15213 USA
[2] Univ Pittsburgh, Sch Med, Dept Psychiat & Human Genet, 3811 OHara St, Pittsburgh, PA 15213 USA
关键词
Single cell RNA sequencing; hierarchical model; empirical Bayes; Gibbs sampling; EM algorithm; GENE-EXPRESSION; INFERENCE;
D O I
10.1214/17-AOAS1110
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the "dropout" events. A "dropout" happens when the RNA for a gene fails to be amplified prior to sequencing, producing a "false" zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes' approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.
引用
收藏
页码:609 / 632
页数:24
相关论文
共 44 条
[1]   Deconvolution of Blood Microarray Data Identifies Cellular Activation Patterns in Systemic Lupus Erythematosus [J].
Abbas, Alexander R. ;
Wolslegel, Kristen ;
Seshasayee, Dhaya ;
Modrusan, Zora ;
Clark, Hilary F. .
PLOS ONE, 2009, 4 (07)
[2]  
[Anonymous], 2017, BIORXIV
[3]   Variational Inference: A Review for Statisticians [J].
Blei, David M. ;
Kucukelbir, Alp ;
McAuliffe, Jon D. .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (518) :859-877
[4]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[5]  
Brennecke P, 2013, NAT METHODS, V10, P1093, DOI [10.1038/nmeth.2645, 10.1038/NMETH.2645]
[6]   Human cerebral organoids recapitulate gene expression programs of fetal neocortex development [J].
Camp, J. Gray ;
Badsha, Farhath ;
Florio, Marta ;
Kanton, Sabina ;
Gerber, Tobias ;
Wilsch-Braeuninger, Michaela ;
Lewitus, Eric ;
Sykes, Alex ;
Hevers, Wulf ;
Lancaster, Madeline ;
Knoblich, Juergen A. ;
Lachmann, Robert ;
Paeaebo, Svante ;
Huttner, Wieland B. ;
Treutlein, Barbara .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2015, 112 (51) :15672-15677
[7]   EXPLAINING THE GIBBS SAMPLER [J].
CASELLA, G ;
GEORGE, EI .
AMERICAN STATISTICIAN, 1992, 46 (03) :167-174
[8]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[9]  
Donoho D, 2003, ADV NEURAL INFORM PR
[10]  
Dupuy C, 2017, J MACH LEARN RES, V18