A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

被引:1
|
作者
Van, Richard [1 ,3 ]
Alvarez, Daniel [2 ,3 ]
Mize, Travis [4 ]
Gannavarapu, Sravani [2 ,3 ]
Chintham Reddy, Lohitha [2 ,3 ]
Nasoz, Fatma [2 ,3 ]
Han, Mira V. [1 ,3 ]
机构
[1] Univ Nevada, Sch Life Sci, Las Vegas, NV 89154 USA
[2] Univ Nevada, Dept Comp Sci, Las Vegas, NV USA
[3] Nevada Inst Personalized Med, Las Vegas, NV 89154 USA
[4] Icahn Sch Med Mt Sinai, Inst Genom Hlth, New York, NY USA
基金
美国国家卫生研究院;
关键词
RNA-Seq; Classification; Cancer; Batch effect correction; Normalization; Data scaling; GENE-EXPRESSION; CANCER; TISSUE; DISCOVERY; REMOVAL;
D O I
10.1186/s12859-024-05801-x
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.Results We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Defining the transcriptomic landscape of Candida glabrata by RNA-Seq
    Linde, Joerg
    Duggan, Seana
    Weber, Michael
    Horn, Fabian
    Sieber, Patricia
    Hellwig, Daniela
    Riege, Konstantin
    Marz, Manja
    Martin, Ronny
    Guthke, Reinhard
    Kurzai, Oliver
    NUCLEIC ACIDS RESEARCH, 2015, 43 (03) : 1392 - 1406
  • [42] SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines
    Audoux, Jerome
    Salson, Mikael
    Grosset, Christophe F.
    Beaumeunier, Sacha
    Holder, Jean-Marc
    Commes, Therese
    Philippe, Nicolas
    BMC BIOINFORMATICS, 2017, 18
  • [43] Transcriptomic analysis across nasal, temporal, and macular regions of human neural retina and RPE/choroid by RNA-Seq
    Whitmore, S. Scott
    Wagner, Alex H.
    DeLuca, Adam P.
    Drack, Arlene V.
    Stone, Edwin M.
    Tucker, Budd A.
    Zeng, Shemin
    Braun, Terry A.
    Mullins, Robert F.
    Scheetz, Todd E.
    EXPERIMENTAL EYE RESEARCH, 2014, 129 : 93 - 106
  • [44] An integrative method to normalize RNA-Seq data
    Cyril Filloux
    Meersseman Cédric
    Philippe Romain
    Forestier Lionel
    Klopp Christophe
    Rocha Dominique
    Maftah Abderrahman
    Petit Daniel
    BMC Bioinformatics, 15
  • [45] An Integrated Approach for RNA-seq Data Normalization
    Yang, Shengping
    Mercante, Donald E.
    Zhang, Kun
    Fang, Zhide
    CANCER INFORMATICS, 2016, 15 : 129 - 141
  • [46] Computational analysis of bacterial RNA-Seq data
    McClure, Ryan
    Balasubramanian, Divya
    Sun, Yan
    Bobrovskyy, Maksym
    Sumby, Paul
    Genco, Caroline A.
    Vanderpool, Carin K.
    Tjaden, Brian
    NUCLEIC ACIDS RESEARCH, 2013, 41 (14) : e140
  • [47] An integrative method to normalize RNA-Seq data
    Filloux, Cyril
    Cedric, Meersseman
    Romain, Philippe
    Lionel, Forestier
    Christophe, Klopp
    Dominique, Rocha
    Abderrahman, Maftah
    Daniel, Petit
    BMC BIOINFORMATICS, 2014, 15
  • [48] Comparison of Microarrays and RNA-Seq for Gene Expression Analyses of Dose-Response Experiments
    Black, Michael B.
    Parks, Bethany B.
    Pluta, Linda
    Chu, Tzu-Ming
    Allen, Bruce C.
    Wolfinger, Russell D.
    Thomas, Russell S.
    TOXICOLOGICAL SCIENCES, 2014, 137 (02) : 385 - 403
  • [49] Comparative RNA-seq based transcriptomic analysis of bud dormancy in grape
    Khalil-Ur-Rehman, Muhammad
    Sun, Long
    Li, Chun-Xia
    Faheem, Muhammad
    Wang, Wu
    Tao, Jian-Min
    BMC PLANT BIOLOGY, 2017, 17
  • [50] Transcriptomic Profile Analysis of Mouse Neural Tube Development by RNA-Seq
    Yu, Juan
    Mu, Jianbing
    Guo, Qian
    Yang, Lihong
    Zhang, Juan
    Liu, Zhizhen
    Yu, Baofeng
    Zhang, Ting
    Xie, Jun
    IUBMB LIFE, 2017, 69 (09) : 706 - 719