Hierarchical clustering of mixed data based on distance hierarchy

被引:78
作者
Hsu, Chung-Chian [1 ]
Chen, Chin-Long [1 ]
Su, Yu-Wei [1 ]
机构
[1] Natl Yunlin Univ Sci & Technol, Dept Informat Management, Touliu 640, Yunlin, Taiwan
关键词
categorical data; distance hierarchy; hierarchical clustering; k-means; mixed data;
D O I
10.1016/j.ins.2007.05.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity. (c) 2007 Elsevier Inc. All rights reserved.
引用
收藏
页码:4474 / 4492
页数:19
相关论文
共 38 条
[21]   Extensions to the k-means algorithm for clustering large data sets with categorical values [J].
Huang, ZX .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (03) :283-304
[22]   A fuzzy k-modes algorithm for clustering categorical data [J].
Huang, ZX ;
Ng, MK .
IEEE TRANSACTIONS ON FUZZY SYSTEMS, 1999, 7 (04) :446-452
[23]   Data clustering: A review [J].
Jain, AK ;
Murty, MN ;
Flynn, PJ .
ACM COMPUTING SURVEYS, 1999, 31 (03) :264-323
[24]   Fuzzy clustering of categorical data using fuzzy centroids [J].
Kim, DW ;
Lee, KH ;
Lee, D .
PATTERN RECOGNITION LETTERS, 2004, 25 (11) :1263-1271
[25]  
Lee S.-G., 2003, INT J INFORM TECHNOL, V2, P135
[26]   Unsupervised learning with mixed numeric and nominal data [J].
Li, C ;
Biswas, G .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (04) :673-690
[27]   Looking into the seeds of time: Discovering temporal patterns in large transaction sets [J].
Li, YJ ;
Zhu, SC ;
Wang, XS ;
Jajodia, S .
INFORMATION SCIENCES, 2006, 176 (08) :1003-1031
[28]   Temporal analysis of clusters of supermarket customers: conventional versus interval set approach [J].
Lingras, P ;
Hogo, M ;
Snorek, M ;
West, C .
INFORMATION SCIENCES, 2005, 172 (1-2) :215-240
[29]   THE VALIDATION OF 4 ULTRAMETRIC CLUSTERING ALGORITHMS [J].
MILLIGAN, GW ;
ISAAC, PD .
PATTERN RECOGNITION, 1980, 12 (02) :41-50
[30]  
Oh CH, 2001, JOINT 9TH IFSA WORLD CONGRESS AND 20TH NAFIPS INTERNATIONAL CONFERENCE, PROCEEDINGS, VOLS. 1-5, P2154, DOI 10.1109/NAFIPS.2001.944403