Reducing over-optimism in variable selection by cross-model validation

被引:147
作者
Anderssen, Endre [1 ]
Dyrstad, Knut
Westad, Frank
Martens, Harald
机构
[1] Norwegian Univ Sci & Technol, Dept Chem, N-7491 Trondheim, Norway
[2] GE Healthcare, Oslo, Norway
[3] Matforsk, N-1430 As, Norway
[4] Univ Life Sci, Ctr Integrated Genom, CIGENE, As, Norway
关键词
variable selection; regression; over-fitting; cross-model validation; jack-knifing; QSAR;
D O I
10.1016/j.chemolab.2006.04.021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extensive optimisation of a mathematical model's fit to a relatively small set of empirical data, may lead to over-optimistic validation results. If the assessment of the final, optimised model is based on the same validation method and the same input data that were used as basis for the extensive model optimisation, accumulated spurious correlations may appear as real predictive ability in the final model validation. An example of this is the use of extensive variable selection in multiple regression, based on a cross-model validation scheme. To illustrate the over-optimism problem in optimisation based on conventional one-layered validation, an artificial data set, with only random numbers was submitted to regression modelling. The model was optimised by stepwise variable selection. A very good apparent predictive ability for y from X was found in the final model by leave-one-out cross-validation (84%), after the number of X-variables had been reduced stepwise from 500 to 29. Finally, the performance of the cross-model validation is tested on one large QSAR data set. Several calibration sets were chosen randomly and a regression model optimised by variable selection. The prediction accuracy of these models was compared to the cross-validation and cross-model validation results. In these tests cross-model validation gives the better measure of model predictive ability. (c) 2006 Published by Elsevier B.V.
引用
收藏
页码:69 / 74
页数:6
相关论文
共 19 条
[1]  
[Anonymous], COMPUTER INTENSIVE S
[2]   Comparative spectra analysis (CoSA): Spectra as three-dimensional molecular descriptors for the prediction of biological activities [J].
Bursi, R ;
Dao, T ;
van Wijk, T ;
de Gooyer, M ;
Kellenbach, E ;
Verwer, P .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1999, 39 (05) :861-867
[3]   COMPARATIVE MOLECULAR-FIELD ANALYSIS (COMFA) .1. EFFECT OF SHAPE ON BINDING OF STEROIDS TO CARRIER PROTEINS [J].
CRAMER, RD ;
PATTERSON, DE ;
BUNCE, JD .
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 1988, 110 (18) :5959-5967
[4]   1977 RIETZ LECTURE - BOOTSTRAP METHODS - ANOTHER LOOK AT THE JACKKNIFE [J].
EFRON, B .
ANNALS OF STATISTICS, 1979, 7 (01) :1-26
[5]   Multivariate design and modeling in QSAR [J].
Eriksson, L ;
Johansson, E .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1996, 34 (01) :1-19
[6]   Multivariate data analysis:: quo vadis?: I.: Object-oriented data modelling (OODM) [J].
Esbensen, KH ;
Höskuldsson, A .
JOURNAL OF CHEMOMETRICS, 2003, 17 (01) :34-44
[7]   STRATEGIES FOR MULTIVARIATE IMAGE REGRESSION [J].
ESBENSEN, KH ;
GELADI, PL ;
GRAHN, HF .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1992, 14 (1-3) :357-374
[8]   EVA: A new theoretically based molecular descriptor for use in QSAR/QSPR analysis [J].
Ferguson, AM ;
Heritage, T ;
Jonathon, P ;
Pack, SE ;
Phillips, L ;
Rogan, J ;
Snaith, PJ .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1997, 11 (02) :143-152
[9]   Multivariate measurement of gene expression relationships [J].
Kim, SC ;
Dougherty, ER ;
Chen, YD ;
Sivakumar, K ;
Meltzer, P ;
Trent, JM ;
Bittner, M .
GENOMICS, 2000, 67 (02) :201-209
[10]   MOLECULAR SIMILARITY INDEXES IN A COMPARATIVE-ANALYSIS (COMSIA) OF DRUG MOLECULES TO CORRELATE AND PREDICT THEIR BIOLOGICAL-ACTIVITY [J].
KLEBE, G ;
ABRAHAM, U ;
MIETZNER, T .
JOURNAL OF MEDICINAL CHEMISTRY, 1994, 37 (24) :4130-4146