A Genetic Algorithm Approach for Clustering Large Data Sets

被引:0
|
作者
Luchi, Diego [1 ]
Rodrigues, Alexandre [1 ]
Varejao, Flavio Miguel [1 ]
Santos, Willian [1 ]
机构
[1] Fed Univ State Espirito Santo, Vitoria, ES, Brazil
关键词
D O I
10.1109/ICTAI.2016.90
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present a sampling approach to run the k-means algorithm in large data sets. We propose a genetic algorithm to guide sampling based on evaluating the fitness of each individual of the population through the k-means clustering algorithm. Although we want a partition with the lowest SSE, our algorithm tries to find the sample with the highest SSE. After finding a good sample the remaining points of the entire data set are clustered using the nearest centroid and, after that, the SSE of the final solution is calculated. Our proposal is applied on a set of public domain data sets and the results are compared against two other methods: the k-means running in a uniform random sample of the data set; and the k-means in the complete data set. The results showed that our algorithm has a good trade off between quality and computational cost, especially for large data sets and higher number of clusters.
引用
收藏
页码:570 / 576
页数:7
相关论文
共 50 条
  • [1] A genetic algorithm for clustering on very large data sets
    Gasvoda, J
    Ding, Q
    COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, 2003, : 163 - 167
  • [2] Data Clustering Based on Approach of Genetic Algorithm
    Wang, Hai-hui
    Zhao, Wen-jie
    2008 CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-11, 2008, : 2753 - 2757
  • [3] Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics
    Olman, Victor
    Mao, Fenglou
    Wu, Hongwei
    Xu, Ying
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2009, 6 (02) : 344 - 352
  • [4] ON K-MEDOID CLUSTERING OF LARGE DATA SETS WITH THE AID OF A GENETIC ALGORITHM - BACKGROUND, FEASIBILITY AND COMPARISON
    LUCASIUS, CB
    DANE, AD
    KATEMAN, G
    ANALYTICA CHIMICA ACTA, 1993, 282 (03) : 647 - 669
  • [5] A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets
    Satman, M. Hakan
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2012, 41 (05) : 644 - 652
  • [6] Genetic Sampling k-means for Clustering Large Data Sets
    Luchi, Diego
    Santos, Willian
    Rodrigues, Alexandre
    Varejao, Flavio Miguel
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2015, 2015, 9423 : 691 - 698
  • [7] A hybrid algorithm for K-medoid clustering of large data sets
    Sheng, WG
    Liu, XH
    CEC2004: PROCEEDINGS OF THE 2004 CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1 AND 2, 2004, : 77 - 82
  • [8] FCM-based clustering algorithm ensemble for large data sets
    Li, Jie
    Gao, Xinbo
    Tian, Chunna
    FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4223 : 559 - 567
  • [9] A CLUSTERING-ALGORITHM FOR DATA-SETS WITH A LARGE NUMBER OF CLASSES
    ZHANG, Q
    WANG, QR
    BOYLE, R
    PATTERN RECOGNITION, 1991, 24 (04) : 331 - 340
  • [10] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
    Wang, Minchao
    Zhang, Wu
    Ding, Wang
    Dai, Dongbo
    Zhang, Huiran
    Xie, Hao
    Chen, Luonan
    Guo, Yike
    Xie, Jiang
    PLOS ONE, 2014, 9 (04):