Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

被引：209

作者：

Schubert, Erich ^{[1
]}

Rousseeuw, Peter J. ^{[2
]}

机构：

[1] Tech Univ Dortmund, Dortmund, Germany

[2] Katholieke Univ Leuven, Dept Math, Leuven, Belgium

来源：

SIMILARITY SEARCH AND APPLICATIONS (SISAP 2019) | 2019年 / 11807卷

关键词：

Cluster analysis; k-Medoids; PAM; CLARA; CLARANS;

D O I：

10.1007/978-3-030-32047-8_16

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now explore faster intialization strategies. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications.

引用

页码：171 / 187

页数：17

共 20 条

[1]

Arthur D., 2006, Proceedings of the Twenty-Second Annual Symposium on Computational Geometry (SCG'06), P144, DOI 10.1145/1137856.1137880

[2]

Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027

[3] Mechanisms of alternative pre-messenger RNA splicing [J].

Black, DL .

ANNUAL REVIEW OF BIOCHEMISTRY, 2003, 72 :291-336

[4]

Bock HH., 2007, Selected contributions in data analysis and classification. Studies in classification, data analysis, P161, DOI [DOI 10.1007/978-3-540-73560-1_15, DOI 10.1007/978-3-540-73560-115, 10.1007/9783540735601_15]

[5]

Bradley PS, 1997, ADV NEUR IN, V9, P368

[6]

Dheeru D., 2017, UCI MACHINE LEARNING

[7]

Ester M., 1996, P 2 INT C KNOWL DISC

[8] Robust distance-based clustering with applications to spatial data mining [J].

Estivill-Castro, V ;

Houle, ME .

ALGORITHMICA, 2001, 30 (02) :216-242

[9]

Estivill-Castro V., 2002, ACM SIGKDD EXPLOR NE, V4, P65, DOI [10.1145/568574.568575, DOI 10.1145/568574.568575]

[10]

Kaufman L., 1987, Statistical Data Analysis Based on the L1-Norm and Related Methods. First International Conference, P405

← 1 2 →