Nonparametric density estimation by exact leave-p-out cross-validation

被引:43
作者
Celisse, Alain [1 ]
Robin, Stephane [1 ]
机构
[1] Agro Paris Tech, INRA, MIA, UMR 518 AgroParisTech, F-75231 Paris, France
关键词
cross-validation; delete-p cross-validation; density estimation; histogram; kernel; leave-p-out; multiple testing; quadratic risk; V-fold cross-validation;
D O I
10.1016/j.csda.2007.10.002
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The problem of density estimation is addressed by minimization of the L-2-risk for both histogram and kernel estimators. This quadratic risk is estimated by leave-p-out cross-validation (LPO), which is made possible thanks to closed formulas, contrary to common belief. The potential gain in the use of LPO with respect to V-fold cross-validation (V-fold) in terms of the bias-variance trade-off is highlighted. An exact quantification of this extra variability, induced by the preliminary random partition of the data in the V-fold, is proposed. Furthermore, exact expressions are derived for both the bias and the variance of the risk estimator with histograms. Plug-in estimates of these quantities are provided, while their accuracy is assessed thanks to concentration inequalities. An adaptive selection procedure for p in the case of histograms is subsequently presented. This relies on minimization of the mean square error of the LPO risk estimator. Finally a simulation study is carried out which first illustrates the higher reliability of the LPO with respect to the V-fold, and then assesses the behavior of the selection procedure. For instance optimality of leave-one-out (LOO) is shown, at least empirically, in the context of regular histograms. (c) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:2350 / 2368
页数:19
相关论文
共 15 条
[1]  
Bellman RE., 1962, Applied dynamic programming
[2]  
CASTELLAN G, 1999, 9961 U PAR SUD
[3]  
Elisseeff A, 2003, NATO ASI SERIES LEAR
[4]   ON THE HISTOGRAM AS A DENSITY ESTIMATOR - L2 THEORY [J].
FREEDMAN, D ;
DIACONIS, P .
ZEITSCHRIFT FUR WAHRSCHEINLICHKEITSTHEORIE UND VERWANDTE GEBIETE, 1981, 57 (04) :453-476
[5]  
Friedman J, 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5
[6]  
Gyorfi L., 1985, The L 1 View
[7]   Gene-expression profiles in hereditary breast cancer. [J].
Hedenfalk, I ;
Duggan, D ;
Chen, YD ;
Radmacher, M ;
Bittner, M ;
Simon, R ;
Meltzer, P ;
Gusterson, B ;
Esteller, M ;
Kallioniemi, OP ;
Wilfond, B ;
Borg, Å ;
Trent, J ;
Raffeld, M ;
Yakhini, Z ;
Ben-Dor, A ;
Dougherty, E ;
Kononen, J ;
Bubendorf, L ;
Fehrle, W ;
Pittaluga, S ;
Gruvberger, S ;
Loman, N ;
Johannsoson, O ;
Olsson, H ;
Sauter, G .
NEW ENGLAND JOURNAL OF MEDICINE, 2001, 344 (08) :539-548
[8]   Fast cross-validation of high-breakdown resampling methods for PCA [J].
Hubert, Mia ;
Engelen, Sanne .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 51 (10) :5013-5024
[9]  
JOHNSON N, 2005, GEN PROBABILITY MATH
[10]  
Kohavi R., 1995, INT JOINT C ARTIFICI, DOI DOI 10.5555/1643031.1643047