PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency

被引:29
作者
Song, Hwanjun [1 ]
Lee, Jae-Gil [1 ]
Han, Wook-Shin [2 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] POSTECH, Pohang, South Korea
来源
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2017年
基金
新加坡国家研究基金会;
关键词
k-medoids; PAM; parallelization; sampling; Hadoop; Spark;
D O I
10.1145/3097983.3098098
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The k-medoids algorithm is one of the best-known clustering algorithms. Despite this, however, it is not as widely used for big data analytics as the k-means algorithm, mainly because of its high computational complexity. Many studies have attempted to solve the efficiency problem of the k-medoids algorithm, but all such studies have improved efficiency at the expense of accuracy. In this paper, we propose a novel parallel k-medoids algorithm, which we call PAMAE, that achieves both high accuracy and high efficiency. We identify two factors-"global search" and "entire data"-that are essential to achieving high accuracy, but are also very time-consuming if considered simultaneously. Thus, our key idea is to apply them individually through two phases: parallel seeding and parallel refinement, neither of which is costly. The first phase performs global search over sampled data, and the second phase performs local search over entire data. Our theoretical analysis proves that this serial execution of the two phases leads to an accurate solution that would be achieved by global search over entire data. In order to validate the merit of our approach, we implement PAMAE on Spark as well as Hadoop and conduct extensive experiments using various real-world data sets oil 12 Microsoft Azure machines (48 cores). The results show that PAMAE significantly outperforms most of recent parallel algorithms and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm. The source code and data are available at https://github.com/jaegil/k-Medoid.
引用
收藏
页码:1087 / 1096
页数:10
相关论文
共 30 条
[1]  
[Anonymous], 1987, CLUSTERING MEANS MED
[2]  
[Anonymous], 2015, Data mining: the textbook
[3]  
[Anonymous], J EXP ALGORITHM
[4]  
[Anonymous], 2011, J INFORM DATA MANAGE
[5]  
[Anonymous], 2013, NIPS 13, DOI DOI 10.1007/S10898-019-00840-8
[6]  
[Anonymous], 2011, J STAT ED
[7]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[8]   Weiszfeld's Method: Old and New Results [J].
Beck, Amir ;
Sabach, Shoham .
JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 2015, 164 (01) :1-40
[9]   Assessing a mixture model for clustering with the integrated completed likelihood [J].
Biernacki, C ;
Celeux, G ;
Govaert, G .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (07) :719-725
[10]   MULTIVARIATE LOCATION ESTIMATION USING EXTENSION OF R-ESTIMATES THROUGH U-STATISTICS TYPE APPROACH [J].
CHAUDHURI, P .
ANNALS OF STATISTICS, 1992, 20 (02) :897-916