A clustering-based discretization for supervised learning

被引:40
作者
Gupta, Ankit [2 ]
Mehrotra, Kishan G. [1 ]
Mohan, Chilukuri [1 ]
机构
[1] Syracuse Univ, Dept Elect Engn & Comp Sci, Ctr Sci & Technol 4 106, Syracuse, NY 13244 USA
[2] Indian Inst Technol, Dept Elect Engn, Kanpur 208016, Uttar Pradesh, India
关键词
Discretization; Clustering; Binning; Supervised learning;
D O I
10.1016/j.spl.2010.01.015
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We address the problem of discretization of continuous variables for machine learning classification algorithms. Existing procedures do not use interdependence between the variables towards this goal. Our proposed method uses clustering to exploit such interdependence. Numerical results show that this improves the classification performance in almost all cases. Even if an existing algorithm can successfully operate with continuous variables, better performance is obtained if the variables are first discretized. An additional advantage of discretization is that it reduces the overall computation time. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:816 / 824
页数:9
相关论文
共 23 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]  
[Anonymous], 2014, C4. 5: programs for machine learning
[3]  
[Anonymous], 1993, Proceedings of the 13th International Joint Conference on Artificial Intelligence
[4]  
BAY SD, 2000, INT C KNOWL DISC DAT, P315
[5]  
BURGER AL, 1996, COMPUTATIONAL LINGUI, P39
[6]  
CATLETT J, 1991, LECT NOTES ARTIF INT, V482, P164, DOI 10.1007/BFb0017012
[7]   Global discretization of continuous attributes as preprocessing for machine learning [J].
Chmielewski, MR ;
GrzymalaBusse, JW .
INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 1996, 15 (04) :319-331
[8]   CLUSTER SEPARATION MEASURE [J].
DAVIES, DL ;
BOULDIN, DW .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1979, 1 (02) :224-227
[9]  
Dougherty J., 1995, MACH LEARN, V10, P57
[10]  
Ertoz L., 2002, P 2 SIAM INT C DAT M, V8