Min max kurtosis distance based improved initial centroid selection approach of K-means clustering for big data mining on gene expression data

被引:3
作者
Pandey, Kamlesh Kumar [1 ]
Shukla, Diwakar [1 ,2 ]
机构
[1] Dr Hari Singh Gour Vishwavidyalaya, Dept Comp Sci & Applicat, Sagar, MP, India
[2] Dr Hari Singh Gour Vishwavidyalaya, Dept Math & Stat, Sagar, India
关键词
Big data clustering; Initial centroid algorithm; Genome clustering; Gene expression data clustering; Kurtosis clustering; Systematic sampling; Sorting heuristic; Convergence speed; K-means; ALGORITHM; ROBUST;
D O I
10.1007/s12530-022-09447-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Genome clustering is one of the big data applications that identify the prognosis of terrifying diseases and biological processes across enormous sets of genes. The K-Means (KM) algorithm is the most commonly used clustering algorithm for gene expression data that extracts hidden knowledge, patterns and trends from gene expression profiles for decision-making strategies. Unfortunately, the KM algorithm is extremely sensitive to initial centroid selection since the initial centroid of clusters influences computational effectiveness, efficiency, cost and local optima issues. The existing initial centroid initialization algorithm attains high computational complexity due to extensive iterations, distance computation, data and result comparison on high dimensional data. To overcome these weaknesses, this study suggested the Min-Max Kurtosis Distance (MMKD) algorithm for big data clustering in a single machine environment. The MMKD algorithm resolves the KM clustering weaknesses by the distance between data points of origin and minimum-maximum kurtosis dimension. The performance of the proposed algorithm is compared to KM, KM++ , ADV, MKM, Mean-KM, NFD, K-MAM, NRKM2, FMNN and MuKM algorithms by internal and external effectiveness validation metrics with efficiency measurement on sixteen gene expression datasets. The experimental evaluation demonstrates that the MMKDKM algorithm reduces iterations, local optima, computation costs, and improves cluster performance, effectiveness and efficiency with stable convergence than other algorithms. The statistical analysis of this study promised that the proposed MMKDKM algorithm achieves a significant difference.
引用
收藏
页码:207 / 244
页数:38
相关论文
共 45 条
  • [31] An Hybrid Approach for Data Clustering Using K-Means and Teaching Learning Based Optimization
    Mummareddy, Pavan Kumar
    Satapaty, Suresh Chandra
    [J]. EMERGING ICT FOR BRIDGING THE FUTURE, VOL 2, 2015, 338 : 165 - 171
  • [32] Effect of Corpus Size Selection on Performance of Map-Reduce Based Distributed K-Means for Big Textual Data Clustering
    Ketu, Shwet
    Prasad, Bakshi Rohit
    Agarwal, Sonali
    [J]. 6TH INTERNATIONAL CONFERENCE ON COMPUTER & COMMUNICATION TECHNOLOGY (ICCCT-2015), 2015, : 256 - 260
  • [33] K-walks: clustering gene-expression data using a K-means clustering algorithm optimised by random walks
    Yao, Min
    Wu, Qinghua
    Li, Juan
    Huang, Tinghua
    [J]. INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2016, 16 (02) : 121 - 140
  • [34] An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data
    Nidheesh, N.
    Nazeer, K. A. Abdul
    Ameer, P. M.
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2017, 91 : 213 - 221
  • [35] A Combined Approach Based on K-Means and Modified Electromagnetism-Like Mechanism for Data Clustering
    Mehdizadeh, Esmaeil
    Teimouri, Mohammad
    Zaretalab, Arash
    Niaki, S. T. A.
    [J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING, 2017, 16 (05) : 1279 - 1307
  • [36] Big data-informed energy efficiency assessment of China industry sectors based on K-means clustering
    Liu, Gengyuan
    Yang, Jin
    Hao, Yan
    Zhang, Yan
    [J]. JOURNAL OF CLEANER PRODUCTION, 2018, 183 : 304 - 314
  • [37] Cloud Based K-Means Clustering Running as a MapReduce Job for Big Data Healthcare Analytics Using Apache Mahout
    Rallapalli, Sreekanth
    Gondkar, R. R.
    Rao, Golajapu Venu Madhava
    [J]. INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS, VOL 1, INDIA 2016, 2016, 433 : 127 - 135
  • [38] Big data clustering algorithm of power system user load characteristics based on K-means and SOM neural network
    Zhu J.
    Han X.
    [J]. Multimedia Tools and Applications, 2025, 84 (10) : 7477 - 7491
  • [39] Taxi Travel Distance Clustering Method Based on Exponential Fitting and k-Means Using Data from the US and China
    Song, Zhenang
    Cai, Jun
    Yang, Qiyao
    [J]. SYSTEMS, 2024, 12 (08):
  • [40] Multi-Agents Approach for Data Mining Based k-Means for Improving the Decision Process in the ERP Systems
    Mesbahi, Nadjib
    Kazar, Okba
    Benharzallah, Saber
    Zoubeidi, Merouane
    [J]. INTERNATIONAL JOURNAL OF DECISION SUPPORT SYSTEM TECHNOLOGY, 2015, 7 (02) : 1 - 14