The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

被引:12
作者
Zhang, Yuqing [1 ]
Bernau, Christoph [2 ]
Parmigiani, Giovanni [3 ,4 ]
Waldron, Levi [5 ]
机构
[1] Boston Univ, Grad Program Bioinformat, 24 Cummington Mall, Boston, MA 02215 USA
[2] Univ Munich, Dept Med Informat Biometry & Epidemiol, Marchioninistr 15, D-81377 Munich, Germany
[3] Dana Farber Canc Inst, Dept Biostat & Computat Biol, 3 Blackfan Cir, Boston, MA 02115 USA
[4] Harvard TH Chan Sch Publ Hlth, Dept Biostat, 677 Huntington Ave, Boston, MA 02115 USA
[5] CUNY, Inst Implementat Sci Populat Hlth, Grad Sch Publ Hlth & Hlth Policy, 55 W 125th St, New York, NY 10027 USA
基金
美国国家卫生研究院;
关键词
Cross-study validation; Data heterogeneity; Genomic prediction models; CROSS-STUDY VALIDATION; GENE-EXPRESSION;
D O I
10.1093/biostatistics/kxy044
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.
引用
收藏
页码:253 / 268
页数:16
相关论文
共 32 条
  • [1] NONPARAMETRIC INFERENCE FOR A FAMILY OF COUNTING PROCESSES
    AALEN, O
    [J]. ANNALS OF STATISTICS, 1978, 6 (04) : 701 - 726
  • [2] Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome
    Abubucker, Sahar
    Segata, Nicola
    Goll, Johannes
    Schubert, Alyxandria M.
    Izard, Jacques
    Cantarel, Brandi L.
    Rodriguez-Mueller, Beltran
    Zucker, Jeremy
    Thiagarajan, Mathangi
    Henrissat, Bernard
    White, Owen
    Kelley, Scott T.
    Methe, Barbara
    Schloss, Patrick D.
    Gevers, Dirk
    Mitreva, Makedonka
    Huttenhower, Curtis
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2012, 8 (06)
  • [3] Generating survival times to simulate Cox proportional hazards models
    Bender, R
    Augustin, T
    Blettner, M
    [J]. STATISTICS IN MEDICINE, 2005, 24 (11) : 1713 - 1723
  • [4] Cross-study validation for the assessment of prediction algorithms
    Bernau, Christoph
    Riester, Markus
    Boulesteix, Anne-Laure
    Parmigiani, Giovanni
    Huttenhower, Curtis
    Waldron, Levi
    Trippa, Lorenzo
    [J]. BIOINFORMATICS, 2014, 30 (12) : 105 - 112
  • [5] Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models
    Binder, Harald
    Schumacher, Martin
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [6] An empirical assessment of validation practices for molecular classifiers
    Castaldi, Peter J.
    Dahabreh, Issa J.
    Ioannidis, John P. A.
    [J]. BRIEFINGS IN BIOINFORMATICS, 2011, 12 (03) : 189 - 202
  • [7] Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate
    Chang, Lo-Bin
    Geman, Donald
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2015, 110 (511) : 1239 - 1247
  • [8] Cortes C, 2008, LECT NOTES ARTIF INT, V5254, P38, DOI 10.1007/978-3-540-87987-9_8
  • [9] Higher criticism thresholding: Optimal feature selection when useful features are rare and weak
    Donoho, David
    Jin, Jiashun
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (39) : 14790 - 14795
  • [10] Improvements on cross-validation: The .632+ bootstrap method
    Efron, B
    Tibshirani, R
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1997, 92 (438) : 548 - 560