Soft subspace clustering of categorical data with probabilistic distance

被引:40
作者
Chen, Lifei [1 ,2 ]
Wang, Shengrui [3 ]
Wang, Kaijun [1 ,2 ]
Zhu, Jianping [4 ,5 ]
机构
[1] Fujian Normal Univ, Sch Math & Comp Sci, Fuzhou 350117, Fujian, Peoples R China
[2] Fujian Normal Univ, Fujian Prov Key Lab Network Secur & Cryptol, Fuzhou 350117, Fujian, Peoples R China
[3] Univ Sherbrooke, Dept Comp Sci, Sherbrooke, PQ J1K 2R1, Canada
[4] Xiamen Univ, Sch Management, Xiamen 361005, Peoples R China
[5] Xiamen Univ, Data Mining Res Ctr, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金; 加拿大自然科学与工程研究理事会;
关键词
Subspace clustering; Categorical data; Distance measure; Attribute weighting; Kernel density estimation; ALGORITHM;
D O I
10.1016/j.patcog.2015.09.027
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Categorical data clustering is an important subject in pattern recognition. Currently, subspace clustering of categorical data remains an open problem due to the difficulties in estimating attribute interestingness according to the statistics of categories in clusters. In this paper, a new algorithm is proposed for clustering categorical data with a novel soft feature-selection scheme, by which each categorical attribute is automatically assigned a weight that correlates with the smoothed dispersion of the categories in a cluster. In the proposed algorithm, dissimilarity between categorical data objects is measured using a probabilistic distance function, based on kernel density estimation for categorical attributes. We also make use of the probabilistic distances to define a cluster validity index for estimating the number of categorical clusters. The suitability of the proposal is demonstrated in an empirical study done with some widely used real-world data sets and synthetic data sets, and the results show its outstanding performance. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:322 / 332
页数:11
相关论文
共 31 条
[1]  
Aggarwal CC, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P61, DOI 10.1145/304181.304188
[2]  
[Anonymous], 2004, ACM SIGKDD EXPLOR NE
[3]  
[Anonymous], 2004, SIGKDD EXPLOR, DOI DOI 10.1145/1007730.1007731
[4]  
[Anonymous], 2005, METRON
[5]  
[Anonymous], 1990, PROC 3 INT C NEURAL
[6]  
[Anonymous], 2013, IJCAI
[7]  
[Anonymous], 2012, MACHINE LEARNING PRO
[8]   A novel attribute weighting algorithm for clustering high-dimensional categorical data [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin ;
Cao, Fuyuan .
PATTERN RECOGNITION, 2011, 44 (12) :2843-2861
[9]  
Boriah S., 2008, P 8 SIAM INT C DAT M, P243, DOI DOI 10.1137/1.9781611972788.22
[10]   Clustering categorical data in projected spaces [J].
Bouguessa, Mohamed .
DATA MINING AND KNOWLEDGE DISCOVERY, 2015, 29 (01) :3-38