PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency

被引：29

作者：

Song, Hwanjun ^{[1
]}

Lee, Jae-Gil ^{[1
]}

Han, Wook-Shin ^{[2
]}

机构：

[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea

[2] POSTECH, Pohang, South Korea

来源：

KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2017年

基金：

新加坡国家研究基金会;

关键词：

k-medoids; PAM; parallelization; sampling; Hadoop; Spark;

D O I：

10.1145/3097983.3098098

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The k-medoids algorithm is one of the best-known clustering algorithms. Despite this, however, it is not as widely used for big data analytics as the k-means algorithm, mainly because of its high computational complexity. Many studies have attempted to solve the efficiency problem of the k-medoids algorithm, but all such studies have improved efficiency at the expense of accuracy. In this paper, we propose a novel parallel k-medoids algorithm, which we call PAMAE, that achieves both high accuracy and high efficiency. We identify two factors-"global search" and "entire data"-that are essential to achieving high accuracy, but are also very time-consuming if considered simultaneously. Thus, our key idea is to apply them individually through two phases: parallel seeding and parallel refinement, neither of which is costly. The first phase performs global search over sampled data, and the second phase performs local search over entire data. Our theoretical analysis proves that this serial execution of the two phases leads to an accurate solution that would be achieved by global search over entire data. In order to validate the merit of our approach, we implement PAMAE on Spark as well as Hadoop and conduct extensive experiments using various real-world data sets oil 12 Microsoft Azure machines (48 cores). The results show that PAMAE significantly outperforms most of recent parallel algorithms and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm. The source code and data are available at https://github.com/jaegil/k-Medoid.

引用

页码：1087 / 1096

页数：10

共 30 条

[1]

[Anonymous], 1987, CLUSTERING MEANS MED

[2]

[Anonymous], 2015, Data mining: the textbook

[3]

[Anonymous], J EXP ALGORITHM

[4]

[Anonymous], 2011, J INFORM DATA MANAGE