Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis

被引:2
|
作者
Kujawa, Tomasz [1 ]
Marczyk, Michal [1 ,2 ]
Polanska, Joanna [1 ]
机构
[1] Silesian Tech Univ, Dept Data Sci & Engn, Gliwice, Poland
[2] Yale Sch Med, Yale Canc Ctr, New Haven, CT USA
关键词
single-cell RNA sequencing; data integration; batch correction; differential gene expression; joint analysis;
D O I
10.3389/fgene.2022.1009316
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Trajectory-based differential expression analysis for single-cell sequencing data
    Van den Berge, Koen
    Roux de Bezieux, Hector
    Street, Kelly
    Saelens, Wouter
    Cannoodt, Robrecht
    Saeys, Yvan
    Dudoit, Sandrine
    Clement, Lieven
    NATURE COMMUNICATIONS, 2020, 11 (01)
  • [22] Integration single-cell and bulk RNA-sequencing data to reveal senescence gene expression profiles in heart failure
    Kuai, Zheng
    Hu, Yu
    HELIYON, 2023, 9 (06)
  • [23] Differential gene expression analysis for multi-subject single-cell RNA-sequencing studies with aggregateBioVar
    Thurman, Andrew L.
    Ratcliff, Jason A.
    Chimenti, Michael S.
    Pezzulo, Alejandro A.
    BIOINFORMATICS, 2021, 37 (19) : 3243 - 3251
  • [24] UMI-count modeling and differential expression analysis for single-cell RNA sequencing
    Chen, Wenan
    Li, Yan
    Easton, John
    Finkelstein, David
    Wu, Gang
    Chen, Xiang
    GENOME BIOLOGY, 2018, 19
  • [25] UMI-count modeling and differential expression analysis for single-cell RNA sequencing
    Wenan Chen
    Yan Li
    John Easton
    David Finkelstein
    Gang Wu
    Xiang Chen
    Genome Biology, 19
  • [26] Inverse weighting method with jackknife variance estimator for differential expression analysis of single-cell RNA sequencing data
    Zhou, Lingjie
    Pan, Qing
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2022, 100
  • [27] FastCAR: fast correction for ambient RNA to facilitate differential gene expression analysis in single-cell RNA-sequencing datasets
    Berg, Marijn
    Petoukhov, Ilya
    van den Ende, Inge
    Meyer, Kerstin B.
    Guryev, Victor
    Vonk, Judith M.
    Carpaij, Orestes
    Banchero, Martin
    Hendriks, Rudi W.
    van den Berge, Maarten
    Nawijn, Martijn C.
    BMC GENOMICS, 2023, 24 (01)
  • [28] Differential analysis of binarized single-cell RNA sequencing data captures biological variation
    Bouland, Gerard A.
    Mahfouz, Ahmed
    Reinders, Marcel J. T.
    NAR GENOMICS AND BIOINFORMATICS, 2021, 3 (04)
  • [29] FastCAR: fast correction for ambient RNA to facilitate differential gene expression analysis in single-cell RNA-sequencing datasets
    Marijn Berg
    Ilya Petoukhov
    Inge van den Ende
    Kerstin B. Meyer
    Victor Guryev
    Judith M. Vonk
    Orestes Carpaij
    Martin Banchero
    Rudi W. Hendriks
    Maarten van den Berge
    Martijn C. Nawijn
    BMC Genomics, 24
  • [30] Correction: iDESC: identifying differential expression in single-cell RNA sequencing data with multiple subjects
    Yunqing Liu
    Jiayi Zhao
    Taylor S. Adams
    Ningya Wang
    Jonas C. Schupp
    Weimiao Wu
    John E. McDonough
    Geoffrey L. Chupp
    Naftali Kaminski
    Zuoheng Wang
    Xiting Yan
    BMC Bioinformatics, 24 (1)