Min max kurtosis distance based improved initial centroid selection approach of K-means clustering for big data mining on gene expression data

被引：3

作者：

Pandey, Kamlesh Kumar ^{[1
]}

Shukla, Diwakar ^{[1
,2
]}

机构：

[1] Dr Hari Singh Gour Vishwavidyalaya, Dept Comp Sci & Applicat, Sagar, MP, India

[2] Dr Hari Singh Gour Vishwavidyalaya, Dept Math & Stat, Sagar, India

来源：

EVOLVING SYSTEMS | 2023年 / 14卷 / 02期

关键词：

Big data clustering; Initial centroid algorithm; Genome clustering; Gene expression data clustering; Kurtosis clustering; Systematic sampling; Sorting heuristic; Convergence speed; K-means; ALGORITHM; ROBUST;

D O I：

10.1007/s12530-022-09447-z

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Genome clustering is one of the big data applications that identify the prognosis of terrifying diseases and biological processes across enormous sets of genes. The K-Means (KM) algorithm is the most commonly used clustering algorithm for gene expression data that extracts hidden knowledge, patterns and trends from gene expression profiles for decision-making strategies. Unfortunately, the KM algorithm is extremely sensitive to initial centroid selection since the initial centroid of clusters influences computational effectiveness, efficiency, cost and local optima issues. The existing initial centroid initialization algorithm attains high computational complexity due to extensive iterations, distance computation, data and result comparison on high dimensional data. To overcome these weaknesses, this study suggested the Min-Max Kurtosis Distance (MMKD) algorithm for big data clustering in a single machine environment. The MMKD algorithm resolves the KM clustering weaknesses by the distance between data points of origin and minimum-maximum kurtosis dimension. The performance of the proposed algorithm is compared to KM, KM++ , ADV, MKM, Mean-KM, NFD, K-MAM, NRKM2, FMNN and MuKM algorithms by internal and external effectiveness validation metrics with efficiency measurement on sixteen gene expression datasets. The experimental evaluation demonstrates that the MMKDKM algorithm reduces iterations, local optima, computation costs, and improves cluster performance, effectiveness and efficiency with stable convergence than other algorithms. The statistical analysis of this study promised that the proposed MMKDKM algorithm achieves a significant difference.

引用

页码：207 / 244

页数：38

共 45 条

[21] A k-means clustering-based security framework for mobile data mining
Guizani, Sghaier
WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2016, 16 (18) : 3449 - 3454
[22] PSO-based K-Means clustering with enhanced cluster matching for gene expression data
Lam, Yau-King
Tsang, P. W. M.
Leung, Chi-Sing
NEURAL COMPUTING & APPLICATIONS, 2013, 22 (7-8) : 1349 - 1355
[23] PSO-based K-Means clustering with enhanced cluster matching for gene expression data
Yau-King Lam
P. W. M. Tsang
Chi-Sing Leung
Neural Computing and Applications, 2013, 22 : 1349 - 1355
[24] Cancer tissue detection using improved K-means initialization method for multi-dimensional microarray big data clustering
Pandey K.K.
Shukla D.
Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (07) : 9277 - 9303
[25] Initial points selection for clustering gene expression data: A spatial contiguity analysis-based approach
Yi, Hui
Bo, Cuimei
Song, Xiaofeng
Yuan, Yuhao
BIO-MEDICAL MATERIALS AND ENGINEERING, 2014, 24 (06) : 3709 - 3717
[26] Data Mining & Pattern Recognition of Voltage Sag Based on K-means Clustering Algorithm
Duan, R. C.
Wang, F. H.
Zhang, J.
Huang, R. H.
Zhang, X.
2015 IEEE POWER & ENERGY SOCIETY GENERAL MEETING, 2015,
[27] DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering
Lakshmi, R.
Baskar, S.
JOURNAL OF INFORMATION SCIENCE, 2019, 45 (06) : 818 - 832
[28] Visualization of Medical Volume Data Based on Improved K-Means Clustering and Segmentation Rules
Ma, Ji
Muad, Yazan Ahmad
Chen, Jinjin
IEEE ACCESS, 2021, 9 : 100498 - 100512
[29] Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data
Godichon-Baggioni, Antoine
Maugis-Rabusseau, Cathy
Rau, Andrea
JOURNAL OF APPLIED STATISTICS, 2019, 46 (01) : 47 - 65
[30] Improving Clustering Efficiency by SimHash-based K-Means Algorithm for Big Data Analytics
Wang, Jenq-Haur
Lin, Jia-Zhi
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 1881 - 1888

← 1 2 3 4 5 →