Random Ordinality Ensembles: Ensemble methods for multi-valued categorical data

被引:8
作者
Ahmad, Amir [1 ]
Brown, Gavin [2 ]
机构
[1] King Abdulaziz Univ, Fac Comp & Informat Technol, Rabigh, Saudi Arabia
[2] Univ Manchester, Sch Comp Sci, Manchester M13 9PL, Lancs, England
关键词
Classifier ensemble; Decision tree; Categorical data; Multi-way split; Binary split; LEARNING ALGORITHM; CLASSIFIERS; DISTANCE;
D O I
10.1016/j.ins.2014.10.064
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data with multi-valued categorical attributes can cause major problems for decision trees. The high branching factor can lead to data fragmentation, where decisions have little or no statistical support. In this paper, we propose a new ensemble method, Random Ordinality Ensembles (ROE), that reduces this problem, and provides significantly improved accuracies over current ensemble methods. We perform a random projection of the categorical data into a continuous space. As the transformation to continuous data is a random process, each dataset has a different imposed ordinality. A decision tree that learns on this new continuous space is able to use binary splits, hence reduces the data fragmentation problem. Generally, these binary trees are accurate. Diverse training datasets ensure diverse decision trees in the ensemble. We created two variants of the technique, ROE. In the first variant, we used decision trees as the base models for ensembles. In the second variant, we combined the attribute randomisation of Random Subspaces with Random Ordinality. These methods match or outperform other popular ensemble methods. Different properties of these ensembles were studied. The study suggests that random ordinality trees are generally more accurate and smaller than multi-way split decision trees. It is also shown that random ordinality attributes can be used to improve Bagging and AdaBoost. M1 ensemble methods. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:75 / 94
页数:20
相关论文
共 48 条
[1]   A k-mean clustering algorithm for mixed numeric and categorical data [J].
Ahmad, Amir ;
Dey, Lipika .
DATA & KNOWLEDGE ENGINEERING, 2007, 63 (02) :503-527
[2]   A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set [J].
Ahmad, Amir ;
Dey, Lipika .
PATTERN RECOGNITION LETTERS, 2007, 28 (01) :110-118
[3]  
Ahmad A, 2009, LECT NOTES COMPUT SC, V5519, P222, DOI 10.1007/978-3-642-02326-2_23
[4]   Combined 5 x 2 cv F test for comparing supervised classification learning algorithms [J].
Alpaydin, E .
NEURAL COMPUTATION, 1999, 11 (08) :1885-1892
[5]  
[Anonymous], 2001, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
[6]  
[Anonymous], 2001, THESIS U CALIFORNIA
[7]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[8]  
Boriah S, 2008, SIAM INT C DAT MIN, P243, DOI DOI 10.1137/1.9781611972788.22
[9]  
Bratko I., 1986, SEM AL METH STAT LON
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32