CPCQ: Contrast pattern based clustering quality index for categorical data

被引:8
作者
Liu, Qingbao [1 ]
Dong, Guozhu [2 ]
机构
[1] Natl Univ Def Technol, Coll Informat Syst & Management, Changsha 410073, Hunan, Peoples R China
[2] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
关键词
Clustering validation; Contrast pattern; Clustering quality index; VALIDATION;
D O I
10.1016/j.patcog.2011.10.007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering validation is concerned with assessing the quality of clustering solutions. Since clustering is unsupervised and highly explorative, clustering validation has been an important and long standing research problem. Existing validity measures, including entropy-based and distance-based indices, have significant shortcomings. Indeed, for many datasets from the UCI repository, they fail to recognize that the expert-determined classes are the best clusters and they frequently give preference to clusterings with larger number of clusters. Their weakness reflects their inability to accurately capture intra-cluster coherence and inter-cluster separation. This paper proposes a novel Contrast Pattern based Clustering Quality index (CPCQ) for categorical data, by utilizing the quality and diversity of the contrast patterns, which contrast the clusters in given clusterings. High quality contrast patterns can serve to characterize the clusters and discriminate one cluster against the others. The CPCQ index is based on the rationale that a high-quality clustering should have many diversified high-quality contrast patterns among its clusters. The quality of individual contrast patterns is defined in terms of their length, support, and the length of their corresponding closed pattern. The quality measure concerning "many diversified" contrast patterns is defined in terms of the quality and diversity of some selected groups of contrast patterns with minimal overlap among contrast patterns and groups in terms of items and matching transactions. Experiments show that the CPCQ index (1) does not require a user to provide a distance function; (2) does not give inappropriate preference to larger number of clusters; (3) can recognize that expert-determined classes are the best clusters for many datasets from the UCI repository. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1739 / 1748
页数:10
相关论文
共 28 条
[1]  
[Anonymous], P SIGMDD
[2]  
[Anonymous], 2008, J STAT SOFTWARE
[3]  
[Anonymous], SIGMOD RECORD
[4]  
[Anonymous], P ACM CIKM
[5]  
[Anonymous], P KDD 2007
[6]  
[Anonymous], MATH OPERATIONS RES
[7]  
[Anonymous], 2004, P 21 INT C MACH LEAR
[8]  
[Anonymous], P ACM SIAM S DISCR A
[9]  
[Anonymous], SIAM J COMPUTING
[10]  
[Anonymous], J BIOINFORMATICS COM