Repeated double cross validation

被引:382
作者
Filzmoser, Peter [2 ]
Liebmann, Bettina [1 ]
Varmuza, Kurt [1 ]
机构
[1] Vienna Univ Technol, Inst Chem Engn, Lab Chemometr, A-1060 Vienna, Austria
[2] Vienna Univ Technol, Inst Stat & Probabil Theory, A-1060 Vienna, Austria
关键词
prediction performance; optimum complexity of linear PLS models; cross validation; bootstrap; R; MULTIVARIATE CALIBRATION; GAS-CHROMATOGRAPHY; VARIABLE SELECTION; COMPONENTS; REGRESSION; PROTEOMICS; MODELS; NUMBER;
D O I
10.1002/cem.1225
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Repeated double cross validation (rdCV) is a strategy for (a) optimizing the complexity of regression models and (b) for a realistic estimation of prediction errors when the model is applied to new cases (that are within the population of the data used). This strategy is suited for small data sets and is a complementary method to bootstrap methods. rdCV is a formal, partly new combination of known procedures and methods, and has been implemented in a function for the programming environment R, providing several types of plots for model evaluation. The current version of the software is dedicated to regression models obtained by partial least-squares (PLS). The applied methods for repeated splits of the data into test sets and calibration sets, as well as for estimation of the optimum number of PLS components, are described. The relevance of some parameters (number of segments in CV, number of repetitions) is investigated. rdCV is applied to two data sets from chemistry: (1) determination of glucose concentrations from near infrared (NIR) data in mash samples from bioethanol production; (2) modeling the gas chromatographic retention indices of polycyclic aromatic compounds from molecular descriptors. Models using all original variables and models using a small subset of the variables, selected by a genetic algorithm (GA), are compared by rdCV. Copyright 0 2009 John Wiley & Sons, Ltd.
引用
收藏
页码:160 / 171
页数:12
相关论文
共 32 条
[1]  
[Anonymous], 2009, APPL SPECTROSC, DOI DOI 10.1366/000370210791114185
[2]  
[Anonymous], 1994, An introduction to the bootstrap: CRC press
[3]  
[Anonymous], 2008, LANG ENV STAT COMP
[4]  
[Anonymous], 2017, USER FRIENDLY GUIDE
[5]   WilcoxCV: an R package for fast variable selection in cross-validation [J].
Boulesteix, Anne-Laure .
BIOINFORMATICS, 2007, 23 (13) :1702-1704
[6]   Cross-validation of component models: A critical look at current methods [J].
Bro, R. ;
Kjeldahl, K. ;
Smilde, A. K. ;
Kiers, H. A. L. .
ANALYTICAL AND BIOANALYTICAL CHEMISTRY, 2008, 390 (05) :1241-1251
[7]   Pattern recognition of gas chromatography mass spectrometry of human volatiles in sweat to distinguish the sex of subjects and determine potential discriminatory marker peaks [J].
Dixon, Sarah J. ;
Xu, Yun ;
Brereton, Richard G. ;
Soini, Helena A. ;
Novotny, Milos V. ;
Oberzaucher, Elisabeth ;
Grammer, Karl ;
Penn, Dustin J. .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2007, 87 (02) :161-172
[8]   How to avoid over-fitting in multivariate calibration -: The conventional validation approach and an alternative [J].
Faber, N. M. ;
Rajko, R. .
ANALYTICA CHIMICA ACTA, 2007, 595 (1-2) :98-106
[9]  
Forina M., 1993, Quimica Analitica, V12, P128
[10]   Selecting the optimum number of partial least squares components for the calibration of attenuated total reflectance-mid-infrared spectra of undesigned kerosene samples [J].
Gomez-Carracedo, M. P. ;
Andrade, J. M. ;
Rutledge, D. N. ;
Faber, N. M. .
ANALYTICA CHIMICA ACTA, 2007, 585 (02) :253-265