A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

被引:15
作者
Hornung, Roman [1 ]
Bernau, Christoph [1 ,2 ]
Truntzer, Caroline [3 ]
Wilson, Rory [1 ]
Stadler, Thomas [4 ]
Boulesteix, Anne-Laure [1 ]
机构
[1] Univ Munich, Dept Med Informat Biometry & Epidemiol, D-81377 Munich, Germany
[2] Leibniz Supercomp Ctr, D-85748 Garching, Germany
[3] Pole Rech Univ Bourgogne, Clin & Innovat Prote Platform, F-21000 Dijon, France
[4] Univ Munich, Dept Urol, D-81377 Munich, Germany
来源
BMC MEDICAL RESEARCH METHODOLOGY | 2015年 / 15卷
关键词
Cross-validation; Error estimation; Over-optimism; Practical guidelines; Supervised learning; REAL DATA; VALIDATION; BIAS; SELECTION; CLASSIFICATION;
D O I
10.1186/s12874-015-0088-9
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. Methods: We devise the easily interpretable and general measure CVIIM ("CV Incompleteness Impact Measure") to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. Results: Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. Conclusions: While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.
引用
收藏
页数:15
相关论文
共 26 条
  • [1] Selection bias in gene extraction on the basis of microarray gene-expression data
    Ambroise, C
    McLachlan, GJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) : 6562 - 6566
  • [2] NCBI GEO: archive for functional genomics data sets-update
    Barrett, Tanya
    Wilhite, Stephen E.
    Ledoux, Pierre
    Evangelista, Carlos
    Kim, Irene F.
    Tomashevsky, Maxim
    Marshall, Kimberly A.
    Phillippy, Katherine H.
    Sherman, Patti M.
    Holko, Michelle
    Yefanov, Andrey
    Lee, Hyeseung
    Zhang, Naigong
    Robertson, Cynthia L.
    Serova, Nadezhda
    Davis, Sean
    Soboleva, Alexandra
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) : D991 - D995
  • [3] Bengio Y, 2004, J MACH LEARN RES, V5, P1089
  • [4] Cross-study validation for the assessment of prediction algorithms
    Bernau, Christoph
    Riester, Markus
    Boulesteix, Anne-Laure
    Parmigiani, Giovanni
    Huttenhower, Curtis
    Waldron, Levi
    Trippa, Lorenzo
    [J]. BIOINFORMATICS, 2014, 30 (12) : 105 - 112
  • [5] Correcting the Optimal Resampling-Based Error Rate by Estimating the Error Rate of Wrapper Algorithms
    Bernau, Christoph
    Augustin, Thomas
    Boulesteix, Anne-Laure
    [J]. BIOMETRICS, 2013, 69 (03) : 693 - 702
  • [6] Bin RD, 2014, BMC MED RES METHODOL, V117, P4
  • [7] A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies
    Boulesteix, Anne-Laure
    Hable, Robert
    Lauer, Sabine
    Eugster, Manuel J. A.
    [J]. AMERICAN STATISTICIAN, 2015, 69 (03) : 201 - 212
  • [8] On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al.
    Boulesteix, Anne-Laure
    [J]. BIOINFORMATICS, 2013, 29 (20) : 2664 - 2666
  • [9] Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
    Boulesteix, Anne-Laure
    Strobl, Carolin
    [J]. BMC MEDICAL RESEARCH METHODOLOGY, 2009, 9
  • [10] External validation of multivariable prediction models: a systematic review of methodological conduct and reporting
    Collins, Gary S.
    de Groot, Joris A.
    Dutton, Susan
    Omar, Omar
    Shanyinde, Milensu
    Tajar, Abdelouahid
    Voysey, Merryn
    Wharton, Rose
    Yu, Ly-Mee
    Moons, Karel G.
    Altman, Douglas G.
    [J]. BMC MEDICAL RESEARCH METHODOLOGY, 2014, 14