A multicluster approach to selecting initial sets for clustering of categorical data

被引:1
作者
Santos-Mangudo C. [1 ]
Heras A.J. [1 ]
机构
[1] Complutense University of Madrid, Madrid
关键词
Categorical data; Clustering; K-Modes;
D O I
10.28945/4643
中图分类号
学科分类号
摘要
Aim/Purpose This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclus-ters, obtaining in this way the maximum number of clusters for the whole da-taset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact intro-duces some randomness in the final results of the algorithms. We explore a dif-ferent application of the clustering methodology for categorical data that over-comes the instability problems and ultimately provides a greater clustering effi-ciency. Methodology For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the re-sponse variable is known but not used in the analysis. In our examples, that re-sponse variable can be identified to the real clusters or classes to which the ob-servations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution Simplicity, efficiency and stability are the main advantages of the multicluster method. Findings The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical K-modes algorithm. Recommendations for Practitioners The method can be useful for those researchers working with small and me-dium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendations for Researchers The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes. © 2020 Informing Science Institute. All rights reserved.
引用
收藏
页码:227 / 246
页数:19
相关论文
共 72 条
[1]  
Agresti A., An introduction to categorical data analysis, (2018)
[2]  
Ahmad A., Dey L., A k-mean clustering algorithm for mixed numeric and categorical data, Data and Knowledge Engineering, 63, 2, pp. 503-527, (2007)
[3]  
Ahmad A., Dey L., A method to compute distance between two categorical values of same attrib-ute in unsupervised learning for categorical data set, Pattern Recognition Letters, 28, 1, pp. 110-118, (2007)
[4]  
Aldenderfer M. S., Blashfield R. K., Cluster analysis. Series: Quantitative applications in the social sci-ences, (1984)
[5]  
Altaf S., Waseem M. W., Kazmi L., IDCUP Algorithm to classifying arbitrary shapes and densities for center-based clustering performance analysis, Interdisciplinary Journal of Information, Knowledge, and Manage-ment, 15, pp. 91-108, (2020)
[6]  
Anderberg M. R., Cluster analysis for applications, Probability and Mathematical Statistics: A Series of Mon-ographs and Textbooks, (1973)
[7]  
Bagirov A. M., Karmitsa N., Taheri S., Introduction to clustering, Partitional clustering via non-smooth optimization, pp. 3-13, (2020)
[8]  
Bai L., Liang J., Dang C., Cao F., A cluster centers initialization method for clustering categorical data, Expert Systems with Applications, 39, 9, pp. 8022-8029, (2012)
[9]  
Bailey K. D., Cluster Analysis, Sociological Methodology, 6, pp. 59-128, (1975)
[10]  
Behzadi S., Muller N. S., Plant C., Bohm C., Clustering of mixed-type data considering concept hi-erarchies: Problem specification and algorithm, International Journal of Data Science and Analytics, 10, 3, pp. 233-248, (2020)