A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies

被引:1
|
作者
Van, Richard [1 ,3 ]
Alvarez, Daniel [2 ,3 ]
Mize, Travis [4 ]
Gannavarapu, Sravani [2 ,3 ]
Chintham Reddy, Lohitha [2 ,3 ]
Nasoz, Fatma [2 ,3 ]
Han, Mira V. [1 ,3 ]
机构
[1] Univ Nevada, Sch Life Sci, Las Vegas, NV 89154 USA
[2] Univ Nevada, Dept Comp Sci, Las Vegas, NV USA
[3] Nevada Inst Personalized Med, Las Vegas, NV 89154 USA
[4] Icahn Sch Med Mt Sinai, Inst Genom Hlth, New York, NY USA
基金
美国国家卫生研究院;
关键词
RNA-Seq; Classification; Cancer; Batch effect correction; Normalization; Data scaling; GENE-EXPRESSION; CANCER; TISSUE; DISCOVERY; REMOVAL;
D O I
10.1186/s12859-024-05801-x
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.Results We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.Conclusion By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] A comparison of methods for differential expression analysis of RNA-seq data
    Charlotte Soneson
    Mauro Delorenzi
    BMC Bioinformatics, 14
  • [22] A platform independent RNA-Seq protocol for the detection of transcriptome complexity
    Calabrese, Claudia
    Mangiulli, Marina
    Manzari, Caterina
    Paluscio, Anna Maria
    Caratozzolo, Mariano Francesco
    Marzano, Flaviana
    Kurelac, Ivana
    D'Erchia, Anna Maria
    D'Elia, Domenica
    Licciulli, Flavio
    Liuni, Sabino
    Picardi, Ernesto
    Attimonelli, Marcella
    Gasparre, Giuseppe
    Porcelli, Anna Maria
    Pesole, Graziano
    Sbisa, Elisabetta
    Tullo, Apollonia
    BMC GENOMICS, 2013, 14
  • [23] Meta-analysis of RNA-seq expression data across species, tissues and studies
    Sudmant, Peter H.
    Alexis, Maria S.
    Burge, Christopher B.
    GENOME BIOLOGY, 2015, 16
  • [24] Model-based clustering for RNA-seq data
    Si, Yaqing
    Liu, Peng
    Li, Pinghua
    Brutnell, Thomas P.
    BIOINFORMATICS, 2014, 30 (02) : 197 - 205
  • [25] Utilizing RNA-Seq Data for Cancer Network Inference
    Cai, Ying
    Fendler, Bernard
    Atwal, Gurinder S.
    2012 IEEE INTERNATIONAL WORKSHOP ON GENOMIC SIGNAL PROCESSING AND STATISTICS (GENSIPS), 2012, : 46 - 49
  • [26] Temporal dynamics in meta longitudinal RNA-Seq data
    Oh, Sunghee
    Li, Congjun
    Baldwin, Ransom L.
    Song, Seongho
    Liu, Fang
    Li, Robert W.
    SCIENTIFIC REPORTS, 2019, 9 (1)
  • [27] Comparison of software packages for detecting differential expression in RNA-seq studies
    Seyednasrollah, Fatemeh
    Laiho, Asta
    Elo, Laura L.
    BRIEFINGS IN BIOINFORMATICS, 2015, 16 (01) : 59 - 70
  • [28] SQUID: transcriptomic structural variation detection from RNA-seq
    Ma, Cong
    Shao, Mingfu
    Kingsford, Carl
    GENOME BIOLOGY, 2018, 19
  • [29] Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays
    Agarwal, Ashish
    Koppstein, David
    Rozowsky, Joel
    Sboner, Andrea
    Habegger, Lukas
    Hillier, LaDeana W.
    Sasidharan, Rajkumar
    Reinke, Valerie
    Waterston, Robert H.
    Gerstein, Mark
    BMC GENOMICS, 2010, 11
  • [30] Visualizing the structure of RNA-seq expression data using grade of membership models
    Dey, Kushal K.
    Hsiao, Chiaowen Joyce
    Stephens, Matthew
    PLOS GENETICS, 2017, 13 (03):