External validation measures for K-means clustering: A data distribution perspective

被引:41
作者
Wu, Junjie [1 ]
Chen, Jian [2 ]
Xiong, Hui [3 ]
Xie, Ming [2 ]
机构
[1] Beihang Univ, Sch Econ & Management, Beijing 100083, Peoples R China
[2] Tsinghua Univ, Key Res Inst Humanities & Social Sci Univ, Res Ctr Contemporary Management, Beijing 100084, Peoples R China
[3] Rutgers State Univ, Management Sci & Informat Syst Dept, Newark, NJ 07102 USA
基金
美国国家科学基金会;
关键词
Cluster validation; K-means; External criteria; Normalization;
D O I
10.1016/j.eswa.2008.06.093
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cluster validation is an important part of any cluster analysis. External measures such as entropy, purity and mutual information are often used to evaluate K-means clustering. However, whether these measures are indeed suitable for K-means clustering remains unknown. Along this line, in this paper, we show that a data distribution view is of great use to selecting the right measures for K-means clustering. Specifically, we first introduce the data distribution view of K-means, and the resultant uniform effect on highly imbalanced data sets. Eight external measures widely used in recent data mining tasks are also collected as candidates for K-means evaluation. Then, we demonstrate that only three measures, namely the variation of information (VI), the van Dongen criterion (VD) and the Mirkin metric (M), can detect the negative uniform effect of K-means in the clustering results. We also provide new normalization schemes for these three measures, i.e., VI'(norm), VD'(norm) and M'(norm), which enables the cross-data comparisons of clustering qualities. Finally, we explore some properties such as the consistency and sensitivity of the three measures, and give some advice on how to use them in K-means practice. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:6050 / 6061
页数:12
相关论文
共 31 条
  • [1] [Anonymous], MATH CLASSIFICATION
  • [2] Model-based evaluation of clustering validation measures
    Brun, Marcel
    Sima, Chao
    Hua, Jianping
    Lowey, James
    Carroll, Brent
    Suh, Edward
    Dougherty, Edward R.
    [J]. PATTERN RECOGNITION, 2007, 40 (03) : 807 - 824
  • [3] Cover Thomas M., 2006, Elements of Information Theory, V2nd
  • [4] DeGroot MorrisH., 2001, PROBABILITY STAT, V3rd
  • [5] Dhillon I. S., 2003, Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, P89
  • [6] Duda RO, 2001, Pattern Classification, V2nd
  • [7] Friedman Nir, 2002, P 25 ANN INT ACM SIG
  • [8] MEASURES OF ASSOCIATION FOR CROSS CLASSIFICATIONS
    GOODMAN, LA
    KRUSKAL, WH
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1954, 49 (268) : 732 - 764
  • [9] Halkidi M, 2002, SIGMOD REC, V31, P19, DOI 10.1145/601858.601862
  • [10] Halkidi M, 2002, SIGMOD RECORD, V31, P40, DOI 10.1145/565117.565124