A note on split selection bias in classification trees

被引:39
作者
Shih, YS [1 ]
机构
[1] Natl Chung Cheng Univ, Dept Stat Sci, Chiayi 62117, Taiwan
关键词
Cramer V-2 statistic; Kolmogorov-Smirnov statistic; P-value; Pearson chi-square statistic;
D O I
10.1016/S0167-9473(03)00064-1
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
A common approach to split selection in classification trees is to search through all possible splits generated by predictor variables. A splitting criterion is then used to evaluate those splits and the one with the largest criterion value is usually chosen to actually channel samples into corresponding subnodes. However, this greedy method is biased in variable selection when the numbers of the available split points for each variable are different. Such result may thus hamper the intuitively appealing nature of classification trees. The problem of the split selection bias for two-class tasks with numerical predictors is examined. The statistical explanation of its existence is given and a solution based on the P-values is provided, when the Pearson chi-square statistic is used as the splitting criterion. (C) 2003 Elsevier B.V. All rights reserved.
引用
收藏
页码:457 / 466
页数:10
相关论文
共 24 条
[1]  
Agresti A., 1984, Analysis of Ordinal Categorical Data
[2]  
[Anonymous], 1995, P 14 INT JOINT C ART
[3]  
Breiman L., 1998, CLASSIFICATION REGRE
[4]  
Dannegger F, 2000, STAT MED, V19, P475, DOI 10.1002/(SICI)1097-0258(20000229)19:4<475::AID-SIM351>3.0.CO
[5]  
2-V
[6]  
Dobra A, 2001, P 18 INT C MACH LEAR, P90
[7]  
Efron B., 1994, INTRO BOOTSTRAP, DOI DOI 10.1201/9780429246593
[8]  
Frank E, 1998, P 15 INT C MACH LEAR, P152
[9]   Minimally selected p and other tests for a single abrupt changepoint in a binary sequence [J].
Halpern, AL .
BIOMETRICS, 1999, 55 (04) :1044-1050
[10]  
HAWKINS DM, 1991, AM STAT, V45, P155