Min max kurtosis distance based improved initial centroid selection approach of K-means clustering for big data mining on gene expression data

被引:3
作者
Pandey, Kamlesh Kumar [1 ]
Shukla, Diwakar [1 ,2 ]
机构
[1] Dr Hari Singh Gour Vishwavidyalaya, Dept Comp Sci & Applicat, Sagar, MP, India
[2] Dr Hari Singh Gour Vishwavidyalaya, Dept Math & Stat, Sagar, India
关键词
Big data clustering; Initial centroid algorithm; Genome clustering; Gene expression data clustering; Kurtosis clustering; Systematic sampling; Sorting heuristic; Convergence speed; K-means; ALGORITHM; ROBUST;
D O I
10.1007/s12530-022-09447-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Genome clustering is one of the big data applications that identify the prognosis of terrifying diseases and biological processes across enormous sets of genes. The K-Means (KM) algorithm is the most commonly used clustering algorithm for gene expression data that extracts hidden knowledge, patterns and trends from gene expression profiles for decision-making strategies. Unfortunately, the KM algorithm is extremely sensitive to initial centroid selection since the initial centroid of clusters influences computational effectiveness, efficiency, cost and local optima issues. The existing initial centroid initialization algorithm attains high computational complexity due to extensive iterations, distance computation, data and result comparison on high dimensional data. To overcome these weaknesses, this study suggested the Min-Max Kurtosis Distance (MMKD) algorithm for big data clustering in a single machine environment. The MMKD algorithm resolves the KM clustering weaknesses by the distance between data points of origin and minimum-maximum kurtosis dimension. The performance of the proposed algorithm is compared to KM, KM++ , ADV, MKM, Mean-KM, NFD, K-MAM, NRKM2, FMNN and MuKM algorithms by internal and external effectiveness validation metrics with efficiency measurement on sixteen gene expression datasets. The experimental evaluation demonstrates that the MMKDKM algorithm reduces iterations, local optima, computation costs, and improves cluster performance, effectiveness and efficiency with stable convergence than other algorithms. The statistical analysis of this study promised that the proposed MMKDKM algorithm achieves a significant difference.
引用
收藏
页码:207 / 244
页数:38
相关论文
共 45 条
  • [21] A k-means clustering-based security framework for mobile data mining
    Guizani, Sghaier
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2016, 16 (18) : 3449 - 3454
  • [22] PSO-based K-Means clustering with enhanced cluster matching for gene expression data
    Lam, Yau-King
    Tsang, P. W. M.
    Leung, Chi-Sing
    NEURAL COMPUTING & APPLICATIONS, 2013, 22 (7-8) : 1349 - 1355
  • [23] PSO-based K-Means clustering with enhanced cluster matching for gene expression data
    Yau-King Lam
    P. W. M. Tsang
    Chi-Sing Leung
    Neural Computing and Applications, 2013, 22 : 1349 - 1355
  • [24] Cancer tissue detection using improved K-means initialization method for multi-dimensional microarray big data clustering
    Pandey K.K.
    Shukla D.
    Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (07) : 9277 - 9303
  • [25] Initial points selection for clustering gene expression data: A spatial contiguity analysis-based approach
    Yi, Hui
    Bo, Cuimei
    Song, Xiaofeng
    Yuan, Yuhao
    BIO-MEDICAL MATERIALS AND ENGINEERING, 2014, 24 (06) : 3709 - 3717
  • [26] Data Mining & Pattern Recognition of Voltage Sag Based on K-means Clustering Algorithm
    Duan, R. C.
    Wang, F. H.
    Zhang, J.
    Huang, R. H.
    Zhang, X.
    2015 IEEE POWER & ENERGY SOCIETY GENERAL MEETING, 2015,
  • [27] DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering
    Lakshmi, R.
    Baskar, S.
    JOURNAL OF INFORMATION SCIENCE, 2019, 45 (06) : 818 - 832
  • [28] Visualization of Medical Volume Data Based on Improved K-Means Clustering and Segmentation Rules
    Ma, Ji
    Muad, Yazan Ahmad
    Chen, Jinjin
    IEEE ACCESS, 2021, 9 : 100498 - 100512
  • [29] Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data
    Godichon-Baggioni, Antoine
    Maugis-Rabusseau, Cathy
    Rau, Andrea
    JOURNAL OF APPLIED STATISTICS, 2019, 46 (01) : 47 - 65
  • [30] Improving Clustering Efficiency by SimHash-based K-Means Algorithm for Big Data Analytics
    Wang, Jenq-Haur
    Lin, Jia-Zhi
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 1881 - 1888