Greedy centroid initialization for federated K-means

被引:0
作者
Yang, Kun [1 ]
Amiri, Mohammad Mohammadi [2 ]
Kulkarni, Sanjeev R. [1 ]
机构
[1] Princeton Univ, 98 Charlton St, Princeton, NJ 08540 USA
[2] Rensselaer Polytech Inst, 110 8th St, Troy, NY 12180 USA
关键词
K-means; Clustering; Federated learning; Machine learning; SECURITY; PRIVACY;
D O I
10.1007/s10115-024-02066-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study learning from unlabeled data distributed across clients in a federated fashion where raw data do not leave the corresponding devices. We develop a K-means clustering algorithm within this federated setting where the local datasets are clustered at the clients, and a server generates the global clusters after aggregating the local ones. Given the importance of initialization on the federated K-means algorithm (FKM), our objective is to find better initial centroids by utilizing the local data stored on each client. To this end, we start the centroid initialization at the clients, rather than at the server, since the server initially lacks any preliminary insight into the clients' data. The clients first select their local initial clusters and subsequently share their clustering information (including cluster centroids and sizes)with the server. The server then employs a greedy algorithm to determine the global initial centroids based on the information received from the clients. We refer to this idea as G-FKM. Numerical results obtained from both synthetic and public datasets demonstrate that our pro-posed algorithm demonstrates accelerated convergence, exhibiting reduced within-cluster sum of squares (WCSS) and higher adjusted Rand Index compared to three distinct federated K-means variants. This improvement comes at a relatively low cost of sending limited additional information from the clients to the server, rather than conducting the initialization entirely at the server. Furthermore, we have also observed that the proposed algorithm performs better than the centralized algorithm for cases where the data distribution across the clients is highly skewed
引用
收藏
页码:3393 / 3425
页数:33
相关论文
共 30 条
  • [1] [Anonymous], 2011, Int J Comput Sci Issues (IJCSI)
  • [2] Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027
  • [3] Efficient Privacy Preserving Distributed K-Means for Non-IID Data
    Brandao, Andre
    Mendes, Ricardo
    Vilela, Joao P.
    [J]. ADVANCES IN INTELLIGENT DATA ANALYSIS XIX, IDA 2021, 2021, 12695 : 439 - 451
  • [4] A comparative study of efficient initialization methods for the k-means clustering algorithm
    Celebi, M. Emre
    Kingravi, Hassan A.
    Vela, Patricio A.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (01) : 200 - 210
  • [5] Chung J, 2022, AAAI 2022 INT WORKSH
  • [6] CLUSTER SEPARATION MEASURE
    DAVIES, DL
    BOULDIN, DW
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1979, 1 (02) : 224 - 227
  • [7] Dennis DK, 2021, PR MACH LEARN RES, V139
  • [8] How much can k-means be improved by using better initialization and repeats?
    Franti, Pasi
    Sieranoja, Sami
    [J]. PATTERN RECOGNITION, 2019, 93 : 95 - 112
  • [9] An Efficient Framework for Clustered Federated Learning
    Ghosh, Avishek
    Chung, Jichan
    Yin, Dong
    Ramchandran, Kannan
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (12) : 8076 - 8091
  • [10] Array programming with NumPy
    Harris, Charles R.
    Millman, K. Jarrod
    van der Walt, Stefan J.
    Gommers, Ralf
    Virtanen, Pauli
    Cournapeau, David
    Wieser, Eric
    Taylor, Julian
    Berg, Sebastian
    Smith, Nathaniel J.
    Kern, Robert
    Picus, Matti
    Hoyer, Stephan
    van Kerkwijk, Marten H.
    Brett, Matthew
    Haldane, Allan
    del Rio, Jaime Fernandez
    Wiebe, Mark
    Peterson, Pearu
    Gerard-Marchant, Pierre
    Sheppard, Kevin
    Reddy, Tyler
    Weckesser, Warren
    Abbasi, Hameer
    Gohlke, Christoph
    Oliphant, Travis E.
    [J]. NATURE, 2020, 585 (7825) : 357 - 362