A Genetic Algorithm Approach for Clustering Large Data Sets

被引：0

作者：

Luchi, Diego ^{[1
]}

Rodrigues, Alexandre ^{[1
]}

Varejao, Flavio Miguel ^{[1
]}

Santos, Willian ^{[1
]}

机构：

[1] Fed Univ State Espirito Santo, Vitoria, ES, Brazil

来源：

2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016) | 2016年

关键词：

D O I：

10.1109/ICTAI.2016.90

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper we present a sampling approach to run the k-means algorithm in large data sets. We propose a genetic algorithm to guide sampling based on evaluating the fitness of each individual of the population through the k-means clustering algorithm. Although we want a partition with the lowest SSE, our algorithm tries to find the sample with the highest SSE. After finding a good sample the remaining points of the entire data set are clustered using the nearest centroid and, after that, the SSE of the final solution is calculated. Our proposal is applied on a set of public domain data sets and the results are compared against two other methods: the k-means running in a uniform random sample of the data set; and the k-means in the complete data set. The results showed that our algorithm has a good trade off between quality and computational cost, especially for large data sets and higher number of clusters.

引用

页码：570 / 576

页数：7

共 50 条

[31] RAPID INITIAL CLUSTERING OF LARGE DATA SETS
GAUCH, HG
VEGETATIO, 1980, 42 (1-3): : 103 - 111
[32] Extensions to the k-means algorithm for clustering large data sets with categorical values
Huang, ZX
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (03) : 283 - 304
[33] A fast hierarchical clustering algorithm for large-scale protein sequence data sets
Szilagyi, Sandor M.
Szilagyi, Laszlo
COMPUTERS IN BIOLOGY AND MEDICINE, 2014, 48 : 94 - 101
[34] Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
Zhexue Huang
Data Mining and Knowledge Discovery, 1998, 2 : 283 - 304
[35] Scalable parallel clustering approach for large data using genetic possibilistic fuzzy c-means algorithm
Mathew, Juby
Vijayakumar, R.
2014 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (IEEE ICCIC), 2014, : 226 - 232
[36] A new spectral clustering algorithm for large training sets
Prieto, R
Jiang, J
Choi, CH
2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 147 - 152
[37] Grouping Genetic Algorithm for Data Clustering
Peddi, Santhosh
Singh, Alok
SWARM, EVOLUTIONARY, AND MEMETIC COMPUTING, PT I, 2011, 7076 : 225 - 232
[38] GANY: A Genetic Spectral-based Clustering Algorithm for Large Data Analysis
Menendez, Hector D.
Camacho, David
2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 640 - 647
[39] Clustering for Binary Data Sets by Using Genetic Algorithm-Incremental K-means
Saharan, S.
Baragona, R.
Nor, M. E.
Salleh, R. M.
Asrah, N. M.
INTERNATIONAL SEMINAR ON MATHEMATICS AND PHYSICS IN SCIENCES AND TECHNOLOGY 2017 (ISMAP 2017), 2018, 995
[40] Privacy-preserving constrained spectral clustering algorithm for large-scale data sets
Li, Ji
Wei, Jianghong
Ye, Mao
Liu, Wenfen
Hu, Xuexian
IET INFORMATION SECURITY, 2020, 14 (03) : 321 - 331

← 1 2 3 4 5 →