How does normalization impact RNA-seq disease diagnosis?

被引:18
|
作者
Han, Henry [1 ]
Men, Ke [2 ]
机构
[1] Fordham Univ, Lincoln Ctr, Dept Comp & Informat Sci, New York, NY 10023 USA
[2] Xian Med Univ, Dept Publ Hlth, Xian 710021, Shaanxi, Peoples R China
基金
中国国家自然科学基金;
关键词
RNA-seq; RNA-Seq; Normalization; Big data; Machine learning; SINGULAR-VALUE DECOMPOSITION; EXPRESSION; SELECTION;
D O I
10.1016/j.jbi.2018.07.016
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis. In this study, we investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. We propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, we present an original visualization analysis to compare the performance of normalized data versus raw data. We have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Our visualization analysis also shows that some normalization methods even bring 'outliers', which unavoidably decreases sample detectability in diagnosis. More importantly, our data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, we found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data. Our results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.
引用
收藏
页码:80 / 92
页数:13
相关论文
共 50 条
  • [1] The Impact of Normalization Methods on RNA-Seq Data Analysis
    Zyprych-Walczak, J.
    Szabelska, A.
    Handschuh, L.
    Gorczak, K.
    Klamecka, K.
    Figlerowicz, M.
    Siatkowski, I.
    BIOMED RESEARCH INTERNATIONAL, 2015, 2015
  • [2] Improved RNA-seq normalization
    Fletcher, Michael
    NATURE GENETICS, 2022, 54 (11) : 1584 - 1584
  • [3] Improved RNA-seq normalization
    Michael Fletcher
    Nature Genetics, 2022, 54 : 1584 - 1584
  • [4] Comparison of normalization methods for RNA-Seq data
    Aghababazadeh, Farnoosh A.
    Li, Qian
    Fridley, Brooke L.
    GENETIC EPIDEMIOLOGY, 2018, 42 (07) : 684 - 684
  • [5] An Integrated Approach for RNA-seq Data Normalization
    Yang, Shengping
    Mercante, Donald E.
    Zhang, Kun
    Fang, Zhide
    CANCER INFORMATICS, 2016, 15 : 129 - 141
  • [6] Assessment of Single Cell RNA-Seq Normalization Methods
    Ding, Bo
    Zheng, Lina
    Wang, Wei
    G3-GENES GENOMES GENETICS, 2017, 7 (07): : 2039 - 2045
  • [7] GC-Content Normalization for RNA-Seq Data
    Davide Risso
    Katja Schwartz
    Gavin Sherlock
    Sandrine Dudoit
    BMC Bioinformatics, 12
  • [8] GC-Content Normalization for RNA-Seq Data
    Risso, Davide
    Schwartz, Katja
    Sherlock, Gavin
    Dudoit, Sandrine
    BMC BIOINFORMATICS, 2011, 12
  • [9] Effect of RNA-Seq data normalization on protein interactome mapping for Alzheimer's disease
    Duz, Elif
    Cakir, Tunahan
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2024, 109
  • [10] Evaluation of Normalization Methods for RNA-Seq Gene Expression Estimation
    Wu, Po-Yen
    Phan, John H.
    Zhou, Fengfeng
    Wang, May D.
    2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS, 2011, : 50 - 57