The design and analysis of benchmark experiments

被引:123
作者
Hothorn, T
Leisch, F
Zeileis, A
Hornik, K
机构
[1] Univ Erlangen Nurnberg, Inst Med Informat Biometrie & Epidemiol, D-91054 Erlangen, Germany
[2] Vienna Univ Technol, Inst Stat & Wahrscheinlichkeitstheorie, A-1040 Vienna, Austria
[3] Vienna Univ Econ & Business Adm, Inst Stat & Math, A-1090 Vienna, Austria
基金
奥地利科学基金会;
关键词
bootstrap; cross-validation; hypothesis testing; model comparison; performance;
D O I
10.1198/106186005X59630
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The assessment of the performance of learners by means of benchmark experiments is an established exercise. In practice, benchmark studies are a tool to compare the performance of several competing algorithms for a certain learning problem. Cross-validation or resampling techniques are commonly used to derive point estimates of the performances which are compared to identify algorithms with good properties. For several benchmarking problems, test procedures taking the variability of those point estimates into account have been suggested. Most of the recently proposed inference procedures are based on special variance estimators for the cross-validated performance. We introduce a theoretical framework for inference problems in benchmark experiments and show that standard statistical test procedures can be used to test for differences in the performances. The theory is based on well-defined distributions of performance measures which can be compared with established tests. To demonstrate the usefulness in practice, the theoretical results are applied to regression and classification benchmark studies based on artificial and real world data.
引用
收藏
页码:675 / 699
页数:25
相关论文
共 55 条
[1]   Combined 5 x 2 cv F test for comparing supervised classification learning algorithms [J].
Alpaydin, E .
NEURAL COMPUTATION, 1999, 11 (08) :1885-1892
[2]   Model selection and error estimation [J].
Bartlett, PL ;
Boucheron, S ;
Lugosi, G .
MACHINE LEARNING, 2002, 48 (1-3) :85-113
[3]   An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J].
Bauer, E ;
Kohavi, R .
MACHINE LEARNING, 1999, 36 (1-2) :105-139
[4]  
Berger V., 2002, J MODERN APPL STAT M, V1, DOI 10.22237/jmasm/1020255120
[5]  
Berger VW, 2000, STAT MED, V19, P1319, DOI 10.1002/(SICI)1097-0258(20000530)19:10<1319::AID-SIM490>3.0.CO
[6]  
2-0
[7]  
Blake C.L., 1998, UCI repository of machine learning databases
[8]  
BLOCKEEL H, 2002, J MACHINE LEARNING R, V3, P621, DOI DOI 10.1162/JMLR.2003.3.4-5.621
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]  
BREIMAN L, 1985, J AM STAT ASSOC, V80, P580, DOI 10.2307/2288473