Automatic recovering the number k of clusters in the data by active query selection

被引:1
|
作者
Sousa, Herio [1 ]
de Souto, Marcilio C. P. [2 ]
Kuroshu, Reginaldo M. [1 ]
Lorena, Ana Carolina [3 ]
机构
[1] Univ Fed Sao Paulo, Sao Jose Dos Campos, SP, Brazil
[2] Univ Orleans, Orleans, France
[3] Inst Tecnol Aeronaut, Sao Jose Dos Campos, SP, Brazil
来源
36TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2021 | 2021年
关键词
Constrained clustering; Active query selection; Number of clusters;
D O I
10.1145/3412841.3441978
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One common parameter of many clustering algorithms is the number k of clusters required to partition the data. This is the case of k-means, one of the most popular clustering algorithms from the Machine Learning literature, and its variants. Indeed, when clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this context, one popular procedure used to estimate the number of clusters present in a dataset is to run the clustering algorithm multiple times varying the number of clusters and one of the solutions obtained is chosen based on a given internal clustering validation measure (e.g., silhouette coefficient). This process can be very time consuming as the clustering algorithm must be run several times. In this paper we present some strategies that can be integrated to constrained clustering methods so as to recover automatically the number k of clusters. The idea is that constrained clustering algorithms allow one to incorporate prior information such as if some pairs of instances from the dataset must be placed in the same cluster or not. Still in the context of constrained clustering algorithms, in order to improve the quality of the pairwise constraints given as input to the algorithm, there are approaches that use active methods for pairwise constraint selection. In our proposed strategies we make use of the prior information provided by the pairwise constraints and the concept of neighborhood from active methods not only to build a partition, but also to identify automatically the number k of clusters in the data. Based on nine datasets, we show experimentally that our strategies, besides automatically recovering the number of clusters in the data, lead to the generation of partitions having high quality when evaluated by indicators of clustering performance such as the adjusted Rand index.
引用
收藏
页码:1021 / 1029
页数:9
相关论文
共 38 条
  • [1] Agglomerative fuzzy K-Means clustering algorithm with selection of number of clusters
    Li, Mark Junjie
    Ng, Michael K.
    Cheung, Yiu-ming
    Huang, Joshua Zhexue
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (11) : 1519 - 1534
  • [2] Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities
    Olukanmi, Peter
    Nelwamondo, Fulufhelo
    Marwala, Tshilidzi
    Twala, Bhekisipho
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (08): : 5939 - 5958
  • [3] Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities
    Peter Olukanmi
    Fulufhelo Nelwamondo
    Tshilidzi Marwala
    Bhekisipho Twala
    Neural Computing and Applications, 2022, 34 : 5939 - 5958
  • [4] Medoid Silhouette clustering with automatic cluster number selection
    Lenssen, Lars
    Schubert, Erich
    INFORMATION SYSTEMS, 2024, 120
  • [5] Estimating the number of clusters in a ranking data context
    Calmon, Wilson
    Albi, Mariana
    INFORMATION SCIENCES, 2021, 546 : 977 - 995
  • [6] NSS-AKmeans: An Agglomerative Fuzzy K-Means Clustering Method with Automatic Selection of Cluster Number
    Zhang, Yanfeng
    Xu, Xiaofei
    Ye, Yunming
    2ND IEEE INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER CONTROL (ICACC 2010), VOL. 2, 2010, : 32 - 38
  • [7] SingleCross-clustering: an algorithm for finding elongated clusters with automatic estimation of outliers and number of clusters
    Tellaroli, P.
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2022, 51 (05) : 2412 - 2428
  • [8] Penalized K-Means Algorithms for Finding the Number of Clusters
    Kamgar-Parsi, Behzad
    Kamgar-Parsi, Behrooz
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 969 - 974
  • [9] New approach to determine the optimal number of clusters K in unsupervised classification
    Chabih, Oussama
    Sbai, Sara
    Behja, Hicham
    Louhdi, Mohammed Reda Chbihi
    Zemmouri, El Moukhtar
    Trousse, Brigitte
    2020 6TH IEEE CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'20), 2020, : 348 - 352
  • [10] An examination of indexes for determining the number of clusters in binary data sets
    Evgenia Dimitriadou
    Sara Dolničar
    Andreas Weingessel
    Psychometrika, 2002, 67 : 137 - 159