A new internal clustering validation index for categorical data based on concentration of attribute values

被引:0
作者
Fu L.-W. [1 ]
Wu S. [1 ]
机构
[1] Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing
来源
Gongcheng Kexue Xuebao/Chinese Journal of Engineering | 2019年 / 41卷 / 05期
关键词
Categorical data; Cluster analysis; Dissimi-larity; High dimensional data; Internal clustering validation index; Similarity;
D O I
10.13374/j.issn2095-9389.2019.05.015
中图分类号
学科分类号
摘要
Clustering is a main task of data mining, and its purpose is to identify natural structures in a dataset. The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions, such as clustering algorithms, similarity/dissimilarity, and parameters. For data without a clustering structure, clustering results need to be evaluated. For data with a clustering structure, different results obtained under different algorithms and parameters also need to be further optimized by clustering validation. Moreover, clustering validation is vital to clustering applications, especially when external information is not available. It is applied in algorithm selection, parameter determination, number of clusters determination. Most traditional internal clustering validation indices for numerical data fail to measure the categorical data. Categorical data is a popular data type, and its attribute value is discrete and cannot be ordered. For categorical data, the existing measures have their limitations in different application circumstances. In this paper, a new similarity based on the concentration ratio of every attribute value, called CONC, which can evaluate the similarity of objects in a cluster, was defined. Similarly, a new dissimilarity based on the discrepancy of characteristic attribute values, called DCRP, which can evaluate the dissimilarity between two clusters, was defined. A new internal clustering validation index, called CVC, which is based on CONC and DCRP, was proposed. Compared to other indices, CVC has three characteristics: (1) it evaluates the compactness of a cluster based on the information of the whole dataset and not only that of a cluster; (2) it evaluates the separation between two clusters by several characteristic attributes values so that the clustering information is not lost and the negative effects caused by noise are eliminated; (3) it evaluates the compactness and separation without influence from the number of objects. Furthermore, UCI benchmark datasets were used to compare the proposed index with other internal clustering validation indices (CU, CDCS, and IE). An external index (NMI) was used to evaluate the effect of these internal indices. According to the experiment results, CVC is more effective than the other internal clustering validation indices. In addition, CVC, as an internal index, is more applicable than the NMI external index, because it can evaluate the clustering results without external information. © All right reserved.
引用
收藏
页码:682 / 693
页数:11
相关论文
共 22 条
[1]  
Cornuejols A., Wemmert C., Gancarski P., Et al., Collaborative clustering: why, when, what and how, Inf Fusion, 39, (2017)
[2]  
Yang H., Fu Y., Fan D., Influence of noisy features on internal validation of clustering, Comput Sci, 45, 7, (2018)
[3]  
Cheung Y.M., Jia H., Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number, Pattern Recognit, 46, 8, (2013)
[4]  
dos Santos T.R.L., Zarate L.E., Categorical data clustering: what similarity measure to recommend?, Expert Syst Appl, 42, 3, (2015)
[5]  
Wu S., Jiang D.D., Wang Q., HABOS clustering algorithm for categorical data, Chin J Eng, 38, 7, (2016)
[6]  
Ilango V., Subramanian R., Vasudevan V., Cluster analysis research design model, problems, issues, challenges, trends and tools, Int J Comput Sci Eng, 3, 8, (2011)
[7]  
Huang D., Lai J.H., Wang C.D., Ensemble clustering using factor graph, Pattern Recognit, 50, (2016)
[8]  
Huang D., Wang C.D., Lai J.H., Et al., Clustering ensemble by decision weighting, CAAI Trans Intell Syst, 11, 3, (2016)
[9]  
Zhao X.W., Liang J.Y., Dang C.Y., Clustering ensemble selection for categorical data based on internal validity indices, Pattern Recognit, 69, (2017)
[10]  
Jaskowiak P.A., Moulavi D., Furtado A.C.S., Et al., On strategies for building effective ensembles of relative clustering validity criteria, Knowledge Inf Syst, 47, 2, (2016)