Cluster validity functions for categorical data: a solution-space perspective

被引:13
作者
Bai, Liang [1 ,2 ]
Liang, Jiye [1 ]
机构
[1] Shanxi Univ, Sch Comp & Informat Technol, Key Lab Computat Intelligence & Chinese Informat, Minist Educ, Taiyuan 030006, Shanxi, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Key Lab Network Data Sci & Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Cluster analysis; Cluster validity function; Generalization; Effectiveness; Normalization; K-MODES ALGORITHM; IMPACT;
D O I
10.1007/s10618-014-0387-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For categorical data, there are three widely-used internal validity functions: the -modes objective function, the category utility function and the information entropy function, which are defined based on within-cluster information only. Many clustering algorithms have been developed to use them as objective functions and find their optimal solutions. In this paper, we study the generalization, effectiveness and normalization of the three validity functions from a solution-space perspective. First, we present a generalized validity function for categorical data. Based on it, we analyze the generality and difference of the three validity functions in the solution space. Furthermore, we address the problem whether the between-cluster information is ignored when these validity functions are used to evaluate clustering results. To the end, we analyze the upper and lower bounds of the three validity functions for a given data set, which can help us estimate the clustering difficulty on a data set and compare the performance of a clustering algorithm on different data sets.
引用
收藏
页码:1560 / 1597
页数:38
相关论文
共 40 条
[1]   Finding localized associations in market basket data [J].
Aggarwal, CC ;
Procopiuc, C ;
Yu, PS .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (01) :51-62
[2]  
ANDRITSOS P, 2004, P 9 INT C EXT DAT TE
[3]  
[Anonymous], 2001, Pattern Classification
[4]   The Impact of Cluster Representatives on the Convergence of the K-Modes Type Clustering [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin ;
Cao, Fuyuan .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (06) :1509-1522
[5]   A novel attribute weighting algorithm for clustering high-dimensional categorical data [J].
Bai, Liang ;
Liang, Jiye ;
Dang, Chuangyin ;
Cao, Fuyuan .
PATTERN RECOGNITION, 2011, 44 (12) :2843-2861
[6]  
Barbara D., 2002, Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM 2002, P582, DOI 10.1145/584792.584888
[7]  
BARBARA D., 2002, Applications of Data Mining in Computer Security
[8]  
Baxevanis A.D., 2001, BIOINFORMATICS PRACT, DOI [10.1002/9780470110607, DOI 10.1002/9780470110607]
[9]  
Berry MichaelJ., 1996, Data mining techniques for marketing, sales, and customer support
[10]   On data labeling for clustering categorical data [J].
Chen, Hung-Leng ;
Chuang, Kun-Ta ;
Chen, Ming-Syan .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (11) :1458-1471