Scalable Clustering by Iterative Partitioning and Point Attractor Representation

被引：22

作者：

Shao, Junming ^{[1
]}

Yang, Qinli ^{[1
]}

Dang, Hoang-Vu ^{[2
,4
]}

Schmidt, Bertil ^{[3
]}

Kramer, Stefan ^{[3
]}

机构：

[1] Univ Elect Sci & Technol China, Big Data Res Ctr, West Hitech Zone, 2006 Xiyuan Ave, Chengdu 611731, Peoples R China

[2] Johannes Gutenberg Univ Mainz, Mainz, Germany

[3] Johannes Gutenberg Univ Mainz, Inst Comp Sci, Staudingerweg 9, D-55128 Mainz, Germany

[4] Univ Illinois, Dept Comp Sci, 1304 W Springfield Ave, Urbana, IL 61801 USA

来源：

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA | 2016年 / 11卷 / 01期

基金：

中国博士后科学基金; 中国国家自然科学基金;

关键词：

High-performance algorithm; scalable clustering; synchronization; SYNCHRONIZATION; ALGORITHMS;

D O I：

10.1145/2934688

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the cluster structure of the original dataset by clustering the newly generated dataset consisting of points attractors and outliers from all subsets. We demonstrate that our new scalable clustering approach has several attractive benefits: (a) CIPA faithfully captures the cluster structure of the original data by performing clustering on each separate data iteratively instead of using any sampling or statistical summarization technique. (b) It allows clustering very large datasets efficiently with high cluster quality. (c) CIPA is parallelizable and also suitable for distributed data. Extensive experiments demonstrate the effectiveness and efficiency of our approach.

引用

页数：23

共 37 条

[1] The Kuramoto model:: A simple paradigm for synchronization phenomena [J].

Acebrón, JA ;

Bonilla, LL ;

Vicente, CJP ;

Ritort, F ;

Spigler, R .

REVIEWS OF MODERN PHYSICS, 2005, 77 (01) :137-185

[2]

Adinetz A, 2013, LECT NOTES COMPUT SC, V8097, P838, DOI 10.1007/978-3-642-40047-6_83

[3]

Aeyels Filip De Smet Dirk, 2008, PHYSICA D, V273

[4]

Andritsos P, 2004, LECT NOTES COMPUT SC, V2992, P123

[5] Synchronization in complex networks [J].

Arenas, Alex ;

Diaz-Guilera, Albert ;

Kurths, Jurgen ;

Moreno, Yamir ;

Zhou, Changsong .

PHYSICS REPORTS-REVIEW SECTION OF PHYSICS LETTERS, 2008, 469 (03) :93-153

[6] Scalable K-Means++ [J].

Bahmani, Bahman ;

Moseley, Benjamin ;

Vattani, Andrea ;

Kumar, Ravi ;

Vassilvitskii, Sergei .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07) :622-633

[7] Scalable clustering algorithms with balancing constraints [J].

Banerjee, Arindam ;

Ghosh, Joydeep .

DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 13 (03) :365-395

[8]

Bohm C, 2009, P 18 ACM C INF KNOWL, P661, DOI DOI 10.1145/1645953.1646038

[9]

Bohm<spacing C., 2010, P 16 ACM SIGKDD, P583

[10]

Bradley P. S., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P9

← 1 2 3 4 →