Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

被引:114
作者
Dinh, Duy-Tai [1 ]
Fujinami, Tsutomu [1 ]
Huynh, Van-Nam [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Knowledge Sci, 1-1 Asahidai, Nomi, Ishikawa 9231292, Japan
来源
KNOWLEDGE AND SYSTEMS SCIENCES, KSS 2019 | 2019年 / 1103卷
关键词
Data mining; Partitional clustering; Categorical data; Silhouette value; Number of clusters; ALGORITHM;
D O I
10.1007/978-981-15-1209-4_1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The problem of estimating the number of clusters (say k) is one of the major challenges for the partitional clustering. This paper proposes an algorithm named k-SCC to estimate the optimal k in categorical data clustering. For the clustering step, the algorithm uses the kernel density estimation approach to define cluster centers. In addition, it uses an information-theoretic based dissimilarity to measure the distance between centers and objects in each cluster. The silhouette analysis based approach is then used to evaluate the quality of different clusterings obtained in the former step to choose the best k. Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k-SCC with three other algorithms. Experimental results show that k-SCC outperforms the compared algorithms in determining the number of clusters for each dataset.
引用
收藏
页码:1 / 17
页数:17
相关论文
共 20 条
[1]  
[Anonymous], 2008, P 8 SIAM INT C DAT M, DOI [DOI 10.1137/1.9781611972788.22, 10.1137/1.9781611972788.22]
[2]  
[Anonymous], 2013, IJCAI
[3]   A novel clustering algorithm based on data transformation approaches [J].
Azimi, Rasool ;
Ghayekhloo, Mohadeseh ;
Ghofrani, Mahmoud ;
Sajedi, Hedieh .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 76 :59-70
[4]  
Berkhin P, 2006, GROUPING MULTIDIMENSIONAL DATA: RECENT ADVANCES IN CLUSTERING, P25
[5]  
Dinh D.T, 2019, DATA CLUSTERING MIXE
[6]   k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values [J].
Dinh, Duy-Tai ;
Huynh, Van-Nam .
MODELING DECISIONS FOR ARTIFICIAL INTELLIGENCE (MDAI 2018), 2018, 11144 :267-279
[7]   Categorical data clustering: What similarity measure to recommend? [J].
dos Santos, Tiago R. L. ;
Zarate, Luis E. .
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (03) :1247-1260
[8]   Extensions to the k-means algorithm for clustering large data sets with categorical values [J].
Huang, ZX .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (03) :283-304
[9]   Determining the number of clusters using information entropy for mixed data [J].
Liang, Jiye ;
Zhao, Xingwang ;
Li, Deyu ;
Cao, Fuyuan ;
Dang, Chuangyin .
PATTERN RECOGNITION, 2012, 45 (06) :2251-2265
[10]  
Lin D., 1998, Machine Learning. Proceedings of the Fifteenth International Conference (ICML'98), P296