Optimally splitting cases for training and testing high dimensional classifiers

被引:252
作者
Dobbin, Kevin K. [1 ]
Simon, Richard M. [2 ]
机构
[1] Univ Georgia, Coll Publ Hlth, Dept Epidemiol & Biostat, Athens, GA 30602 USA
[2] NCI, Biometr Res Branch, NIH, Rockville, MD USA
关键词
CLASS PREDICTION; MICROARRAY; CLASSIFICATION; SURVIVAL; CANCER;
D O I
10.1186/1755-8794-4-31
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results: We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions: By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n >= 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.
引用
收藏
页数:8
相关论文
共 22 条
[1]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[2]  
[Anonymous], 2000, Pattern Classification
[3]   Identification and classification of differentially expressed genes in renal cell carcinoma by expression profiling on a global human 31,500-element cDNA array [J].
Boer, JM ;
Huber, WK ;
Sültmann, H ;
Wilmer, F ;
von Heydebreck, A ;
Haas, S ;
Korn, B ;
Gunawan, B ;
Vente, A ;
Füzesi, L ;
Vingron, M ;
Poustka, A .
GENOME RESEARCH, 2001, 11 (11) :1861-1870
[4]  
DEVROYE L, 1996, PROBABILISTIC THEOR
[5]   Sample size planning for developing classifiers using high-dimensional DNA microarray data [J].
Dobbin, Kevin K. ;
Simon, Richard M. .
BIOSTATISTICS, 2007, 8 (01) :101-117
[6]   A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions [J].
Dobbin, Kevin K. .
BIOSTATISTICS, 2009, 10 (02) :282-296
[7]   Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting [J].
Dupuy, Alain ;
Simon, Richard M. .
JNCI-JOURNAL OF THE NATIONAL CANCER INSTITUTE, 2007, 99 (02) :147-157
[8]  
Efron B., 1993, An Introduction to the Bootstrap Boca Raton, V1
[9]   How many samples are needed to build a classifier: a general sequential approach [J].
Fu, WJJ ;
Dougherty, ER ;
Mallick, B ;
Carroll, RJ .
BIOINFORMATICS, 2005, 21 (01) :63-70
[10]  
Fukunaga K, 1990, INTRO STAT PATTERN R, V2nd