Automatic recovering the number k of clusters in the data by active query selection

被引：1

作者：

Sousa, Herio ^{[1
]}

de Souto, Marcilio C. P. ^{[2
]}

Kuroshu, Reginaldo M. ^{[1
]}

Lorena, Ana Carolina ^{[3
]}

机构：

[1] Univ Fed Sao Paulo, Sao Jose Dos Campos, SP, Brazil

[2] Univ Orleans, Orleans, France

[3] Inst Tecnol Aeronaut, Sao Jose Dos Campos, SP, Brazil

来源：

36TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2021 | 2021年

关键词：

Constrained clustering; Active query selection; Number of clusters;

D O I：

10.1145/3412841.3441978

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

One common parameter of many clustering algorithms is the number k of clusters required to partition the data. This is the case of k-means, one of the most popular clustering algorithms from the Machine Learning literature, and its variants. Indeed, when clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this context, one popular procedure used to estimate the number of clusters present in a dataset is to run the clustering algorithm multiple times varying the number of clusters and one of the solutions obtained is chosen based on a given internal clustering validation measure (e.g., silhouette coefficient). This process can be very time consuming as the clustering algorithm must be run several times. In this paper we present some strategies that can be integrated to constrained clustering methods so as to recover automatically the number k of clusters. The idea is that constrained clustering algorithms allow one to incorporate prior information such as if some pairs of instances from the dataset must be placed in the same cluster or not. Still in the context of constrained clustering algorithms, in order to improve the quality of the pairwise constraints given as input to the algorithm, there are approaches that use active methods for pairwise constraint selection. In our proposed strategies we make use of the prior information provided by the pairwise constraints and the concept of neighborhood from active methods not only to build a partition, but also to identify automatically the number k of clusters in the data. Based on nine datasets, we show experimentally that our strategies, besides automatically recovering the number of clusters in the data, lead to the generation of partitions having high quality when evaluated by indicators of clustering performance such as the adjusted Rand index.

引用

页码：1021 / 1029

页数：9

共 38 条

[21] Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient
Dinh, Duy-Tai
Fujinami, Tsutomu
Huynh, Van-Nam
KNOWLEDGE AND SYSTEMS SCIENCES, KSS 2019, 2019, 1103 : 1 - 17
[22] Entropy K-Means Clustering With Feature Reduction Under Unknown Number of Clusters
Sinaga, Kristina P.
Hussain, Ishtiaq
Yang, Miin-Shen
IEEE ACCESS, 2021, 9 : 67736 - 67751
[23] Estimating the number of clusters in microarray data sets based on an information theoretic criterion
Nicorici, Daniel
Astola, Jaakko
Yli-Harja, Olli
2005 IEEE/SP 13TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING (SSP), VOLS 1 AND 2, 2005, : 936 - 940
[24] Estimating the number of clusters in a numerical data set via quantization error modeling
Kolesnikov, Alexander
Trichina, Elena
Kauranne, Tuomo
PATTERN RECOGNITION, 2015, 48 (03) : 941 - 952
[25] Inertia-Based Indices to Determine the Number of Clusters in K-Means: An Experimental Evaluation
Rykov, Andrei
de Amorim, Renato Cordeiro
Makarenkov, Vladimir
Mirkin, Boris
IEEE ACCESS, 2024, 12 : 11761 - 11773
[26] A new similarity measure and its use in determining the number of clusters in a multivariate data set
Vassiliou, A
Tambouratzis, DG
Koutras, MV
Bersimis, S
COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2004, 33 (07) : 1643 - 1666
[27] EVALUATION OF COEFFICIENTS FOR DETERMINING THE OPTIMAL NUMBER OF CLUSTERS IN CLUSTER ANALYSIS ON REAL DATA SETS
Loster, Tomas
9TH INTERNATIONAL DAYS OF STATISTICS AND ECONOMICS, 2015, : 1014 - 1023
[28] Quantum clustering in non-spherical data distributions: Finding a suitable number of clusters
Casana-Eslava, Raul V.
Jarman, Ian H.
Lisboa, Paulo J. G.
Martin-Guerrero, Jose D.
NEUROCOMPUTING, 2017, 268 : 127 - 141
[29] Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads
Mark Ming-Tso Chiang
Boris Mirkin
Journal of Classification, 2010, 27 : 3 - 40
[30] Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads
Chiang, Mark Ming-Tso
Mirkin, Boris
JOURNAL OF CLASSIFICATION, 2010, 27 (01) : 3 - 40

← 1 2 3 4 →