Contraction Clustering (RASTER) A Big Data Algorithm for Density-Based Clustering in Constant Memory and Linear Time

被引:0
作者
Ulm, Gregor [1 ]
Gustavsson, Emil [1 ]
Jirstrand, Mats [1 ]
机构
[1] Fraunhofer Chalmers Res Ctr Ind Math, Chalmers Sci Pk, S-41288 Gothenburg, Sweden
来源
MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017 | 2018年 / 10710卷
关键词
Algorithms; Big data; Machine learning Unsupervised learning; Clustering; FUZZY;
D O I
10.1007/978-3-319-72926-8_6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering methods are infeasible due to their memory requirements or runtime complexity. Contraction Clustering (Raster) is a linear-time algorithm for identifying density-based clusters. Its coefficient is negligible as it depends neither on input size nor the number of clusters. Its memory requirements are constant. Consequently, Raster is suitable for big data applications where the size of the data may be huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. Our algorithm is extremely fast. In single-threaded execution on a contemporary workstation, it clusters ten million points in less than 20 s-when using a slow interpreted programming language like Python. Furthermore, Raster is easily parallelizable.
引用
收藏
页码:63 / 75
页数:13
相关论文
共 22 条
[1]  
Agrawal R., 1998, SIGMOD Record, V27, P94, DOI 10.1145/276305.276314
[2]  
[Anonymous], 2016, Adv Neural Inf Process Syst
[3]   An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE) [J].
Baker, David M. ;
Valleron, Alain-Jacques .
INTERNATIONAL JOURNAL OF HEALTH GEOGRAPHICS, 2014, 13
[4]   An efficient approximation to the K-means clustering for massive data [J].
Capo, Marco ;
Perez, Aritz ;
Lozano, Jose A. .
KNOWLEDGE-BASED SYSTEMS, 2017, 117 :56-69
[5]   GMDBSCAN: Multi-Density DBSCAN Cluster Based on Grid [J].
Chen Xiaoyun ;
Min Yufang ;
Zhao Yan ;
Wang Ping .
PROCEEDINGS OF THE ICEBE 2008: IEEE INTERNATIONAL CONFERENCE ON E-BUSINESS ENGINEERING, 2008, :780-783
[6]   BLOB DETECTION BY RELAXATION [J].
DANKER, AJ ;
ROSENFELD, A .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1981, 3 (01) :79-92
[7]  
Ester M., 1996, P 2 INT C KNOWL DISC, V96, P226
[8]   A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis [J].
Fahad, Adil ;
Alshatri, Najlaa ;
Tari, Zahir ;
Alamri, Abdullah ;
Khalil, Ibrahim ;
Zomaya, Albert Y. ;
Foufou, Sebti ;
Bouras, Abdelaziz .
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (03) :267-279
[9]   Extending fuzzy and probabilistic clustering to very large data sets [J].
Hathaway, Richard J. ;
Bezdek, James C. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2006, 51 (01) :215-234
[10]   Grid-based DBSCAN Algorithm with Referential Parameters [J].
Huang Darong ;
Wang Peng .
INTERNATIONAL CONFERENCE ON APPLIED PHYSICS AND INDUSTRIAL ENGINEERING 2012, PT B, 2012, 24 :1166-1170