Min-max kurtosis stratum mean: An improved K-means cluster initialization approach for microarray gene clustering on multidimensional big data

被引:6
作者
Pandey, Kamlesh Kumar [1 ]
Shukla, Diwakar [1 ]
机构
[1] Dr Hari Singh Gour Vishwavidyalaya, Dept Comp Sci & Applicat, Sagar, Madhya Pradesh, India
关键词
big data clustering; gene clustering; initial centroid; K-means; microarray clustering; multidimensional clustering; EXPRESSION DATA; MEANS ALGORITHM; EVOLUTION; SELECTION; SIZE;
D O I
10.1002/cpe.7185
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Microarray gene clustering is a big data application that employs the K-means (KM) clustering algorithm to identify hidden patterns, evolutionary relationships, unknown functions and gene trends for disease diagnosis, tissue detection and biological analysis. The selection of initial centroids is a major issue in the KM algorithm because it influences the effectiveness, efficiency and local optima of the cluster. The existing initial centroid initialization algorithm is computationally expensive and degrades cluster quality due to the large dimensionality and interconnectedness of microarray gene data. To deal with this issue, this study proposed the min-max kurtosis stratum mean (MKSM) algorithm for big data clustering in a single machine environment. The MKSM algorithm uses kurtosis for dimension selection, mean distance for gene relationship identification, and stratification for heterogeneous centroid extraction. The results of the presented algorithm are compared to the state-of-the-art initialization strategy on twelve microarray gene datasets utilizing internal, external and statistical assessment criteria. The experimental results demonstrate that the MKSMKM algorithm reduces iterations, distance computation, data comparison and local optima, and improves cluster performance, effectiveness and efficiency with stable convergence.
引用
收藏
页数:33
相关论文
共 102 条
[71]   Mining gene expression data using data mining techniques : A critical review [J].
Mabu, Audu Musa ;
Prasad, Rajesh ;
Yadav, Raghav .
JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2020, 41 (03) :723-742
[72]   Harmony K-means algorithm for document clustering [J].
Mahdavi, Mehrdad ;
Abolhassani, Hassan .
DATA MINING AND KNOWLEDGE DISCOVERY, 2009, 18 (03) :370-391
[73]   Accuracy and robustness of clustering algorithms for small-size applications in bioinformatics [J].
Minicozzi, Pamela ;
Rapallo, Fabio ;
Scalas, Enrico ;
Dondero, Francesco .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2008, 387 (25) :6310-6318
[74]   A new initialization and performance measure for the rough k-means clustering [J].
Murugesan, Vijaya Prabhagar ;
Murugesan, Punniyamoorthy .
SOFT COMPUTING, 2020, 24 (15) :11605-11619
[75]   An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data [J].
Nidheesh, N. ;
Nazeer, K. A. Abdul ;
Ameer, P. M. .
COMPUTERS IN BIOLOGY AND MEDICINE, 2017, 91 :213-221
[76]  
Pandey, 2019, INT J EMERG TECHNOL, V10, P226, DOI [10.14716/ijtech.v10i2.2137, DOI 10.14716/IJTECH.V10I2.2137]
[77]  
Pandey Kamlesh Kumar, 2020, Social Networking and Computational Intelligence. Proceedings of SCI-2018. Lecture Notes in Networks and Systems (LNNS 100), P427, DOI 10.1007/978-981-15-2071-6_34
[78]   Maxmin Data Range Heuristic-Based Initial Centroid Method of Partitional Clustering for Big Data Mining [J].
Pandey, Kamlesh Kumar ;
Shukla, Diwakar .
INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2022, 12 (01)
[79]   Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining [J].
Pandey, Kamlesh Kumar ;
Shukla, Diwakar .
PATTERN ANALYSIS AND APPLICATIONS, 2022, 25 (01) :139-156
[80]   Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data [J].
Pandey, Kamlesh Kumar ;
Shukla, Diwakar .
INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2022, 13 (03) :1239-1253