MGR: An information theory based hierarchical divisive clustering algorithm for categorical data

被引:21
作者
Qin, Hongwu [1 ,2 ]
Ma, Xiuqin [1 ,2 ]
Herawan, Tutut [3 ]
Zain, Jasni Mohamad [1 ]
机构
[1] Univ Malaysia Pahang, Fac Comp Syst & Software Engn, Gambang 26300, Kuantan, Malaysia
[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Gansu, Peoples R China
[3] Univ Malaya, Fac Comp Sci & Informat Technol, Kuala Lumpur 50603, Malaysia
关键词
Data mining; Clustering; Categorical data; Gain ratio; Information theory; K-MODES ALGORITHM; DATA SETS;
D O I
10.1016/j.knosys.2014.03.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Categorical data clustering has attracted much attention recently due to the fact that much of the data contained in today's databases is categorical in nature. While many algorithms for clustering categorical data have been proposed, some have low clustering accuracy while others have high computational complexity. This research proposes mean gain ratio (MGR), a new information theory based hierarchical divisive clustering algorithm for categorical data. MGR implements clustering from the attributes viewpoint which includes selecting a clustering attribute using mean gain ratio and selecting an equivalence class on the clustering attribute using entropy of clusters. It can be run with or without specifying the number of clusters while few existing clustering algorithms for categorical data can be run without specifying the number of clusters. Experimental results on nine University of California at Irvine (UCI) benchmark and ten synthetic data sets show that MGR performs better as compared to baseline algorithms in terms of its performance and efficiency of clustering. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:401 / 411
页数:11
相关论文
共 40 条
[1]  
Abdu E., 2009, P 2 WORKSH DAT MIN U, P393
[2]  
Andritsos P, 2004, LECT NOTES COMPUT SC, V2992, P123
[3]  
[Anonymous], 2014, C4. 5: programs for machine learning
[4]  
[Anonymous], 1999, P 5 ACM SIGKDD INT C
[5]   A cluster centers initialization method for clustering categorical data [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin ;
Cao, Fuyuan .
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (09) :8022-8029
[6]   An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin .
KNOWLEDGE-BASED SYSTEMS, 2011, 24 (06) :785-795
[7]  
Barbara D., 2002, Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM 2002, P582, DOI 10.1145/584792.584888
[8]   A dissimilarity measure for the k-Modes clustering algorithm [J].
Cao, Fuyuan ;
Liang, Jiye ;
Li, Deyu ;
Bai, Liang ;
Dang, Chuangyin .
KNOWLEDGE-BASED SYSTEMS, 2012, 26 :120-127
[9]   A new initialization method for categorical data clustering [J].
Cao, Fuyuan ;
Liang, Jiye ;
Bai, Liang .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (07) :10223-10228
[10]  
Chen K., 2005, P 17 INT C SCI STAT, P253