Prediction error estimation: a comparison of resampling methods

被引:937
作者
Molinaro, AM [1 ]
Simon, R
Pfeiffer, RM
机构
[1] NCI, Biostat Branch, Div Canc Epidemiol & Genet, NIH, Rockville, MD 20852 USA
[2] NCI, Biometr Res Branch, Div Canc Treatment & Diagnost, NIH, Rockville, MD 20852 USA
[3] Yale Univ, Sch Med, Dept Epidemiol & Publ Hlth, New Haven, CT 06520 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1093/bioinformatics/bti499
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection. Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. Contact: annette.molinaro@yale.edu Supplementary Information: A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).
引用
收藏
页码:3301 / 3307
页数:7
相关论文
共 30 条
  • [1] [Anonymous], 1994, Modern applied statistics with S-Plus
  • [2] Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
    Bhattacharjee, A
    Richards, WG
    Staunton, J
    Li, C
    Monti, S
    Vasa, P
    Ladd, C
    Beheshti, J
    Bueno, R
    Gillette, M
    Loda, M
    Weber, G
    Mark, EJ
    Lander, ES
    Wong, W
    Johnson, BE
    Golub, TR
    Sugarbaker, DJ
    Meyerson, M
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) : 13790 - 13795
  • [3] Is cross-validation valid for small-sample microarray classification?
    Braga-Neto, UM
    Dougherty, ER
    [J]. BIOINFORMATICS, 2004, 20 (03) : 374 - 380
  • [4] SUBMODEL SELECTION AND EVALUATION IN REGRESSION - THE X-RANDOM CASE
    BREIMAN, L
    SPECTOR, P
    [J]. INTERNATIONAL STATISTICAL REVIEW, 1992, 60 (03) : 291 - 319
  • [5] Breiman L., 1998, CLASSIFICATION REGRE
  • [6] Graphical methods for class prediction using dimension reduction techniques on DNA microarray data
    Bura, E
    Pfeiffer, RM
    [J]. BIOINFORMATICS, 2003, 19 (10) : 1252 - 1258
  • [8] Davison A. C., 1997, CAMBRIDGE SERIES STA, DOI DOI 10.1017/CBO9780511802843
  • [9] DETTLING M, 2005, SOFTWARE R CONTRIBUT
  • [10] DUDOIT S, 2003, UC BERKELEY DIVISION