Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

被引:43
|
作者
Yasrebi, Haleh
Sperisen, Peter
Praz, Viviane
Bucher, Philipp
机构
[1] Swiss Institute for Experimental Cancer Research (ISREC), Swiss Federal Institute of Technology (EPFL), School of Life Sciences, Lausanne
[2] Swiss Institute of Bioinformatics, EPFL SV ISREC, Lausanne
来源
PLOS ONE | 2009年 / 4卷 / 10期
关键词
BREAST-CANCER; MICROARRAY DATA; ESTROGEN-RECEPTOR; HISTOLOGIC GRADE; MARKER GENES; SIGNATURE; PLATFORM; CLASSIFICATION; CARCINOMAS; SUBTYPES;
D O I
10.1371/journal.pone.0007431
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets. Results: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1. Conclusions: Merging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Consensus clustering of gene expression data and its application to gene function prediction
    Xiao, Guanghua
    Pan, Wei
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2007, 16 (03) : 733 - 751
  • [32] Negative correlation based gene markers identification in integrative gene expression data
    Zeng, Tao
    Guo, Xuan
    Liu, Juan
    INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2014, 10 (01) : 1 - 17
  • [33] Gene Expression Profiling for Survival Prediction in Pediatric Rhabdomyosarcomas: A Report From the Children's Oncology Group
    Davicioni, Elai
    Anderson, James R.
    Buckley, Jonathan D.
    Meyer, William H.
    Triche, Timothy J.
    JOURNAL OF CLINICAL ONCOLOGY, 2010, 28 (07) : 1240 - 1246
  • [34] Novel gene sets improve set-level classification of prokaryotic gene expression data
    Holec, Matej
    Kuzelka, Ondrej
    Zelezny, Filip
    BMC BIOINFORMATICS, 2015, 16
  • [35] Improving the Prediction of Survival in Cancer Patients by Using Machine Learning Techniques: Experience of Gene Expression Data: A Narrative Review
    Bashiri, Azadeh
    Ghazisaeedi, Marjan
    Safdari, Reza
    Shahmoradi, Leila
    Ehtesham, Hamide
    IRANIAN JOURNAL OF PUBLIC HEALTH, 2017, 46 (02) : 165 - 172
  • [36] Discovering negative correlated gene sets from integrative gene expression data for cancer prognosis
    Zeng, Tao
    Guo, Xuan
    Liu, Juan
    2010 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2010, : 489 - 492
  • [37] Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets
    Costa, Ivan G.
    Lorena, Ana C.
    Peres, Liciana R. M. P. y
    de Souto, Marcilio C. P.
    ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, PROCEEDINGS, 2009, 5676 : 48 - +
  • [38] Benchmark of filter methods for feature selection in high-dimensional gene expression survival data
    Bommert, Andrea
    Welchowski, Thomas
    Schmid, Matthias
    Rahnenfuehrer, Joerg
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (01)
  • [39] Risk classification of cancer survival using ANN with gene expression data from multiple laboratories
    Chen, Yen-Chen
    Ke, Wan-Chi
    Chiu, Hung-Wen
    COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 48 : 1 - 7
  • [40] Effects of Sample Size on Differential Gene Expression, Rank Order and Prediction Accuracy of a Gene Signature
    Stretch, Cynthia
    Khan, Sheehan
    Asgarian, Nasimeh
    Eisner, Roman
    Vaisipour, Saman
    Damaraju, Sambasivarao
    Graham, Kathryn
    Bathe, Oliver F.
    Steed, Helen
    Greiner, Russell
    Baracos, Vickie E.
    PLOS ONE, 2013, 8 (06):