Holo-Entropy Based Categorical Data Hierarchical Clustering

被引:4
作者
Sun, Haojun [1 ]
Chen, Rongbo [1 ]
Qin, Yong [2 ]
Wang, Shengrui [3 ]
机构
[1] Shantou Univ, Dept Comp Sci, Shantou, Peoples R China
[2] Beijing Jiaotong Univ, State Key Lab Rail Traff Control & Safety, Beijing, Peoples R China
[3] Univ Sherbrooke, Dept Comp Sci, Sherbrooke, PQ, Canada
基金
中国国家自然科学基金;
关键词
hierarchical clustering; holo-entropy; subspace; categorical data; K-MODES ALGORITHM; CLASSIFICATION;
D O I
10.15388/Informatica.2017.131
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering high-dimensional data is a challenging task in data mining, and clustering high-dimensional categorical data is even more challenging because it is more difficult to measure the similarity between categorical objects. Most algorithms assume feature independence when computing similarity between data objects, or make use of computationally demanding techniques such as PCA for numerical data. Hierarchical clustering algorithms are often based on similarity measures computed on a common feature space, which is not effective when clustering high dimensional data. Subspace clustering algorithms discover feature subspaces for clusters, but are mostly partition-based; i.e. they do not produce a hierarchical structure of clusters. In this paper, we propose a hierarchical algorithm for clustering high-dimensional categorical data, based on a recently proposed information-theoretical concept named holo-entropy. The algorithm proposes new ways of exploring entropy, holo-entropy and attribute weighting in order to determine the feature subspace of a cluster and to merge clusters even though their feature subspaces differ. The algorithm is tested on UCI datasets, and compared with several state-of-the-art algorithms. Experimental results show that the proposed algorithm yields higher efficiency and accuracy than the competing algorithms and allows higher reproducibility.
引用
收藏
页码:303 / 328
页数:26
相关论文
共 53 条
[1]   Redefining clustering for high-dimensional applications [J].
Aggarwal, CC ;
Yu, PS .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (02) :210-225
[2]  
Aggarwal CC, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P61, DOI 10.1145/304181.304188
[3]  
Andritsos P, 2004, LECT NOTES COMPUT SC, V2992, P123
[4]  
[Anonymous], AMSTER658
[5]  
[Anonymous], 2004, ACM SIGKDD EXPLOR NE
[6]  
[Anonymous], 2004, SIGKDD EXPLOR, DOI DOI 10.1145/1007730.1007731
[7]  
[Anonymous], 1999, P 5 ACM SIGKDD INT C
[8]  
[Anonymous], U NOTRE DAME NORTE D
[9]   A novel attribute weighting algorithm for clustering high-dimensional categorical data [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin ;
Cao, Fuyuan .
PATTERN RECOGNITION, 2011, 44 (12) :2843-2861
[10]   An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin .
KNOWLEDGE-BASED SYSTEMS, 2011, 24 (06) :785-795