Combining K-MEANS and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering

被引：52

作者：

Islam, Md Zahidul ^{[1
]}

Estivill-Castro, Vladimir ^{[2
]}

Rahman, Md Anisur ^{[1
]}

Bossomaier, Terry ^{[1
]}

机构：

[1] Charles Sturt Univ, Sch Comp & Math, Panorama Ave, Bathurst, NSW 2795, Australia

[2] Griffith Univ, Sch Informat & Computat Technol, Kessels Rd, Nathan, Qld 4111, Australia

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2018年 / 91卷

关键词：

Clustering; Genetic algorithm; K-MEANS; Data mining; Cluster evaluation; EXPRESSION DATA;

D O I：

10.1016/j.eswa.2017.09.005

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge discovery from data can be broadly categorized into two types: supervised and unsupervised. A supervised knowledge discovery process such as classification by decision trees typically requires class labels which are sometimes unavailable in datasets. Unsupervised knowledge discovery techniques such as an unsupervised clustering technique can handle datasets without class labels. They aim to let data reveal the groups (i.e. the data elements in each group) and the number of groups. For the ubiquitous task of clustering, K-MEANS is the most used algorithm applied in a broad range of areas to identify groups where intra-group distances are much smaller than inter-group distances. As a representative-based clustering approach, K-MEANs offers an extremely efficient gradient descent approach to the total squared error of representation; however, it not only demands the parameter k, but it also makes assumptions about the similarity of density among the clusters. Therefore, it is profoundly affected by noise. Perhaps more seriously, it can often be attracted to local optima despite its immersion in a multi-start scheme. We present an effective genetic algorithm that combines the capacity of genetic operators to conglomerate different solutions of the search space with the exploitation of the hill-climber. We advance a previous genetic-searching approach called GENCLUST, with the intervention of fast hill-climbing cycles of K-MEANS and obtain an algorithm that is faster than its predecessor and achieves clustering results of higher quality. We demonstrate this across a series of 18 commonly researched datasets. (C) 2017 Elsevier Ltd. All rights reserved.

引用

页码：402 / 417

页数：16

共 35 条

[1] A k-mean clustering algorithm for mixed numeric and categorical data [J].

Ahmad, Amir ;

Dey, Lipika .

DATA & KNOWLEDGE ENGINEERING, 2007, 63 (02) :503-527

[2]

[Anonymous], 2006, 2006 IEEE COMP SOC C

[3]

[Anonymous], 2007, P 18 ANN ACM SIAM S

[4]

[Anonymous], 2020, Introduction to data mining

[5]

[Anonymous], INT J INF TECHNOL

[6] An evolutionary technique based on K-Means algorithm for optimal clustering in RN [J].

Bandyopadhyay, S ;

Maulik, U .

INFORMATION SCIENCES, 2002, 146 (1-4) :221-237

[7]

Beg AH, 2016, IEEE C EVOL COMPUTAT, P948, DOI 10.1109/CEC.2016.7743892

[8] Genetic algorithm with healthy population and multiple streams sharing information for clustering [J].

Beg, A. H. ;

Islam, Md Zahidul ;

Estivill-Castro, Vladimir .

KNOWLEDGE-BASED SYSTEMS, 2016, 114 :61-78

[9] Average correlation clustering algorithm (ACCA) for grouping of co-regulated genes with similar pattern of variation in their expression values [J].

Bhattacharya, Anindya ;

De, Rajat K. .

JOURNAL OF BIOMEDICAL INFORMATICS, 2010, 43 (04) :560-568

[10] Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps [J].

Brameier, Markus ;

Wiuf, Carsten .

JOURNAL OF BIOMEDICAL INFORMATICS, 2007, 40 (02) :160-173

← 1 2 3 4 →