K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data

被引:616
作者
Ikotun, Abiodun M. [1 ]
Ezugwu, Absalom E. [1 ,2 ]
Abualigah, Laith [3 ,4 ,5 ,6 ,7 ]
Abuhaija, Belal [8 ]
Heming, Jia [9 ]
机构
[1] Univ KwaZulu Natal, Sch Comp Sci, Pietermaritzburg, Kwazulu-natal, South Africa
[2] North West Univ, Unit Data Sci & Comp, 1 Hoffman St Potchefstroom, ZA-2520 Potchefstroom, South Africa
[3] Al Al Bayt Univ, Prince Hussein Bin Abdullah Coll Informat Technol, Mafraq 130040, Jordan
[4] Al Ahliyya Amman Univ, Hourani Ctr Appl Sci Res, Amman 19328, Jordan
[5] Middle East Univ, Fac Informat Technol, Amman 11831, Jordan
[6] Appl Sci Private Univ, Fac Informat Technol, Amman 11931, Jordan
[7] Univ Sains Malaysia, Sch Comp Sci, George Town 11800, Malaysia
[8] Wenzhou Kean Univ, Dept Comp Sci, Wenzhou, Peoples R China
[9] Sanming Univ, Coll Informat & Engn, Sanming, Peoples R China
基金
欧盟地平线“2020”;
关键词
K-means; K-means variants; Clustering algorithm; Modified k-means; Improved k-means; Perspectives on big data clustering; Big data clustering; GENETIC ALGORITHM; FEATURE-REDUCTION; MEANS-PLUS; INITIALIZATION; SEARCH; SELECTION; ENTROPY; VERSION;
D O I
10.1016/j.ins.2022.11.139
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Advances in recent techniques for scientific data collection in the era of big data allow for the systematic accumulation of large quantities of data at various data-capturing sites. Similarly, exponential growth in the development of different data analysis approaches has been reported in the literature, amongst which the K-means algorithm remains the most popular and straightforward clustering algorithm. The broad applicability of the algo-rithm in many clustering application areas can be attributed to its implementation simplic-ity and low computational complexity. However, the K-means algorithm has many challenges that negatively affect its clustering performance. In the algorithm's initialization process, users must specify the number of clusters in a given dataset apriori while the ini-tial cluster centers are randomly selected. Furthermore, the algorithm's performance is susceptible to the selection of this initial cluster and for large datasets, determining the optimal number of clusters to start with becomes complex and is a very challenging task. Moreover, the random selection of the initial cluster centers sometimes results in minimal local convergence due to its greedy nature. A further limitation is that certain data object features are used in determining their similarity by using the Euclidean distance metric as a similarity measure, but this limits the algorithm's robustness in detecting other cluster shapes and poses a great challenge in detecting overlapping clusters. Many research efforts have been conducted and reported in literature with regard to improving the K-means algorithm's performance and robustness. The current work presents an overview and tax-onomy of the K-means clustering algorithm and its variants. The history of the K-means, current trends, open issues and challenges, and recommended future research perspectives are also discussed.(c) 2022 Elsevier Inc. All rights reserved.
引用
收藏
页码:178 / 210
页数:33
相关论文
共 250 条
  • [1] The incremental online k-means clustering algorithm and its application to color quantization
    Abernathy, Amber
    Celebi, M. Emre
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2022, 207
  • [2] Abhishekkumar, 2017, INT J MOD TRENDS ENG, V4, P218, DOI DOI 10.21884/IJMTER.2017.4143.LGJZD
  • [3] A Comprehensive Survey of the Harmony Search Algorithm in Clustering Applications
    Abualigah, Laith
    Diabat, Ali
    Geem, Zong Woo
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (11):
  • [4] A new feature selection method to improve the document clustering using particle swarm optimization algorithm
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    Hanandeh, Essam Said
    [J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2018, 25 : 456 - 466
  • [5] A novel hybridization strategy for krill herd algorithm applied to clustering techniques
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    Hanandeh, Essam Said
    Gandomi, Amir H.
    [J]. APPLIED SOFT COMPUTING, 2017, 60 : 423 - 435
  • [6] Abualigah Laith Mohammad Qasim, 2019, Feature selection and enhanced krill herd algorithm for text document clustering, DOI DOI 10.1007/978-3-030-10674-4
  • [7] Automatic Data Clustering Using Hybrid Firefly Particle Swarm Optimization Algorithm
    Agbaje, Moyinoluwa B.
    Ezugwu, Absalom E.
    Els, Rosanne
    [J]. IEEE ACCESS, 2019, 7 : 184963 - 184984
  • [8] A k-mean clustering algorithm for mixed numeric and categorical data
    Ahmad, Amir
    Dey, Lipika
    [J]. DATA & KNOWLEDGE ENGINEERING, 2007, 63 (02) : 503 - 527
  • [9] Survey of State-of-the-Art Mixed Data Clustering Algorithms
    Ahmad, Amir
    Khan, Shehroz S.
    [J]. IEEE ACCESS, 2019, 7 : 31883 - 31902
  • [10] The k-means Algorithm: A Comprehensive Survey and Performance Evaluation
    Ahmed, Mohiuddin
    Seraj, Raihan
    Islam, Syed Mohammed Shamsul
    [J]. ELECTRONICS, 2020, 9 (08) : 1 - 12