Chance correlation in variable subset regression: Influence of the objective function, the selection mechanism, and ensemble averaging

被引:63
作者
Baumann, K [1 ]
机构
[1] Univ Wurzburg, Dept Pharm, D-97074 Wurzburg, Germany
来源
QSAR & COMBINATORIAL SCIENCE | 2005年 / 24卷 / 09期
关键词
variable selection; The LASSO; ensemble averaging; cross-validation; permutation test; bagging; chance correlation; overfitting;
D O I
10.1002/qsar.200530134
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Cross-validation is often used to guide variable selection algorithms. While cross-validation almost unbiasedly estimates the prediction error when no model selection (such as variable selection) is involved, it is heavily biased when a large amount of model selection is applied (i.e. sifting through thousands of models). In the latter case, the internal figures of merit such as R-CV(2), or RMSEPCV can be deceptively overoptimistic. The extent of this inflation (overoptimism) and the influence factors for the degree of inflation are studied here. It turns out, that the extent of inflation is extremely large for small data sets. The main influence factors for the degree of inflation are data set size, the size of the variable pool, the allowed object variable ratio, the objective function for guiding an stepwise selection technique, and the correlation structure of the data matrix. Moreover, chancying the selection mechanism from the commonly applied stepwise procedures to the more stable shrinking and selection technique LASSO eliminates the inflation largely. No inflation is observed when ensemble averaging is used to estimate the prediction error. The latter property combined with the potential of ensemble averaging to improve the predictivity and the possibility to use the information of the single models of the ensemble for validation tasks, renders ensemble averaging an attractive tool if prediction is the primary goal of the analysis.
引用
收藏
页码:1033 / 1046
页数:14
相关论文
共 39 条