A joint deep learning model enables simultaneous batch effect correction, denoising, and clustering in single-cell transcriptomics

被引:46
作者
Lakkis, Justin [1 ]
Wang, David [2 ]
Zhang, Yuanchao [1 ]
Hu, Gang [3 ]
Wang, Kui [4 ,5 ]
Pan, Huize [6 ]
Ungar, Lyle [7 ]
Reilly, Muredach P. [6 ]
Li, Xiangjie [3 ]
Li, Mingyao [1 ]
机构
[1] Univ Penn, Perelman Sch Med, Dept Biostat Epidemiol & Informat, Philadelphia, PA 19104 USA
[2] Univ Penn, Perelman Sch Med, Grad Grp Genom & Computat Biol, Philadelphia, PA 19104 USA
[3] Nankai Univ, Sch Stat & Data Sci, Key Lab Med Data Anal & Stat Res Tianjin, Tianjin 300071, Peoples R China
[4] Nankai Univ, Sch Math Sci, Dept Informat Theory & Data Sci, Tianjin 300071, Peoples R China
[5] Nankai Univ, LPMC, Tianjin 300071, Peoples R China
[6] Columbia Univ, Irving Med Ctr, Dept Med, Div Cardiol, New York, NY 10032 USA
[7] Univ Penn, Sch Engn & Appl Sci, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
关键词
D O I
10.1101/gr.271874.120
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Recent developments of single-cell RNA-seq (scRNA-seq) technologies have led to enormous biological discoveries. As the scale of scRNA-seq studies increases, a major challenge in analysis is batch effects, which are inevitable in studies involving human tissues. Most existing methods remove batch effects in a low-dimensional embedding space. Although useful for clustering, batch effects are still present in the gene expression space, leaving downstream gene-level analysis susceptible to batch effects. Recent studies have shown that batch effect correction in the gene expression space is much harder than in the embedding space. Methods such as Seurat 3.0 rely on the mutual nearest neighbor (MNN) approach to remove batch effects in gene expression, but MNN can only analyze two batches at a time, and it becomes computationally infeasible when the number of batches is large. Here, we present CarDEC, a joint deep learning model that simultaneously clusters and denoises scRNA-seq data while correcting batch effects both in the embedding and the gene expression space. Comprehensive evaluations spanning different species and tissues showed that CarDEC outperforms Scanorama, DCA + Combat, scVI, and MNN. With CarDEC denoising, non-highly variable genes offer as much signal for clustering as the highly variable genes (HVGs), suggesting that CarDEC substantially boosted information content in scRNA-seq. We also showed that trajectory analysis using CarDEC's denoised and batch-corrected expression as input revealed marker genes and transcription factors that are otherwise obscured in the presence of batch effects. CarDEC is computationally fast, making it a desirable tool for large-scale scRNA-seq studies.
引用
收藏
页码:1753 / 1766
页数:14
相关论文
共 32 条
[1]   Joint analysis of heterogeneous single-cell RNA-seq dataset collections [J].
Barkas, Nikolas ;
Petukhov, Viktor ;
Nikolaeva, Daria ;
Lozinsky, Yaroslav ;
Demharter, Samuel ;
Khodosevich, Konstantin ;
Kharchenko, Peter V. .
NATURE METHODS, 2019, 16 (08) :695-+
[2]   The single-cell transcriptional landscape of mammalian organogenesis [J].
Cao, Junyue ;
Spielmann, Malte ;
Qiu, Xiaojie ;
Huang, Xingfan ;
Ibrahim, Daniel M. ;
Hill, Andrew J. ;
Zhang, Fan ;
Mundlos, Stefan ;
Christiansen, Lena ;
Steemers, Frank J. ;
Trapnell, Cole ;
Shendure, Jay .
NATURE, 2019, 566 (7745) :496-+
[3]   UMI-count modeling and differential expression analysis for single-cell RNA sequencing [J].
Chen, Wenan ;
Li, Yan ;
Easton, John ;
Finkelstein, David ;
Wu, Gang ;
Chen, Xiang .
GENOME BIOLOGY, 2018, 19
[4]  
Ding J, 2020, NAT BIOTECHNOL, V38, P737, DOI 10.1038/s41587-020-0465-8
[5]   Single-cell RNA-seq denoising using a deep count autoencoder [J].
Eraslan, Goekcen ;
Simon, Lukas M. ;
Mircea, Maria ;
Mueller, Nikola S. ;
Theis, Fabian J. .
NATURE COMMUNICATIONS, 2019, 10 (1)
[6]   De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data [J].
Grun, Dominic ;
Muraro, Mauro J. ;
Boisset, Jean-Charles ;
Wiebrands, Kay ;
Lyubimova, Anna ;
Dharmadhikari, Gitanjali ;
van den Born, Maaike ;
van Es, Johan ;
Jansen, Erik ;
Clevers, Hans ;
de Koning, Eelco J. P. ;
van Oudenaarden, Alexander .
CELL STEM CELL, 2016, 19 (02) :266-277
[7]  
Guo XF, 2017, PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1753
[8]   Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors [J].
Haghverdi, Laleh ;
Lun, Aaron T. L. ;
Morgan, Michael D. ;
Marioni, John C. .
NATURE BIOTECHNOLOGY, 2018, 36 (05) :421-+
[9]   Missing data and technical variability in single-cell RNA-sequencing experiments [J].
Hicks, Stephanie C. ;
Townes, F. William ;
Teng, Mingxiang ;
Irizarry, Rafael A. .
BIOSTATISTICS, 2018, 19 (04) :562-578
[10]   Efficient integration of heterogeneous single-cell transcriptomes using Scanorama [J].
Hie, Brian ;
Bryson, Bryan ;
Berger, Bonnie .
NATURE BIOTECHNOLOGY, 2019, 37 (06) :685-+