Influence of single-cell RNA sequencing data integration on the performance of differential gene expression analysis

被引:2
|
作者
Kujawa, Tomasz [1 ]
Marczyk, Michal [1 ,2 ]
Polanska, Joanna [1 ]
机构
[1] Silesian Tech Univ, Dept Data Sci & Engn, Gliwice, Poland
[2] Yale Sch Med, Yale Canc Ctr, New Haven, CT USA
关键词
single-cell RNA sequencing; data integration; batch correction; differential gene expression; joint analysis;
D O I
10.3389/fgene.2022.1009316
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Analysis of single-cell RNA sequencing data based on autoencoders
    Andrea Tangherloni
    Federico Ricciuti
    Daniela Besozzi
    Pietro Liò
    Ana Cvejic
    BMC Bioinformatics, 22
  • [42] Analysis of single-cell RNA sequencing data based on autoencoders
    Tangherloni, Andrea
    Ricciuti, Federico
    Besozzi, Daniela
    Lio, Pietro
    Cvejic, Ana
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [43] Joint gene network construction by single-cell RNA sequencing data
    Dong, Meichen
    He, Yiping
    Jiang, Yuchao
    Zou, Fei
    BIOMETRICS, 2023, 79 (02) : 915 - 925
  • [44] A Comprehensive Survey of Statistical Approaches for Differential Expression Analysis in Single-Cell RNA Sequencing Studies
    Das, Samarendra
    Rai, Anil
    Merchant, Michael L.
    Cave, Matthew C.
    Rai, Shesh N.
    GENES, 2021, 12 (12)
  • [45] Integration for single-cell RNA sequencing data based on the shared cell type assignment
    Zhang, Yin
    Wang, Fei
    2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 232 - 235
  • [46] Cell Type-Specific Analysis of Gene Expression in Rett Syndrome by Single-Cell RNA Sequencing
    Renthal, William
    Boxer, Lisa
    Li, Emmy
    Hrvatin, Sinisa
    Nagy, Aurel
    Greenberg, Michael
    ANNALS OF NEUROLOGY, 2017, 82 : S115 - S115
  • [47] Benchmarking integration of single-cell differential expression
    Hai C. T. Nguyen
    Bukyung Baik
    Sora Yoon
    Taesung Park
    Dougu Nam
    Nature Communications, 14
  • [48] Benchmarking integration of single-cell differential expression
    Nguyen, Hai C. T.
    Baik, Bukyung
    Yoon, Sora
    Park, Taesung
    Nam, Dougu
    NATURE COMMUNICATIONS, 2023, 14 (01)
  • [49] bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data
    Tang, Wenhao
    Bertaux, Francois
    Thomas, Philipp
    Stefanelli, Claire
    Saint, Malika
    Marguerat, Samuel
    Shahrezaei, Vahid
    BIOINFORMATICS, 2020, 36 (04) : 1174 - 1181
  • [50] Protocol to benchmark gene expression signature scoring techniques for single-cell RNA sequencing data in cancer
    Noureen, Nighat
    Wang, Xiaojing
    Zheng, Siyuan
    STAR PROTOCOLS, 2022, 3 (04):