Fast and effective Big Data exploration by clustering

被引:33
作者
Ianni, Michele [1 ]
Masciari, Elio [2 ]
Mazzeo, Giuseppe M. [3 ]
Mezzanzanica, Mario [4 ]
Zaniolo, Carlo [5 ]
机构
[1] Univ Calabria, DIMES, Arcavacata Di Rende, Italy
[2] Univ Naples Federico II, DIETI, Naples, Italy
[3] Facebook, Menlo Pk, CA USA
[4] Milano Bicocca Univ, DISMEQ, Milan, Italy
[5] Univ Calif Los Angeles, Comp Sci, Los Angeles, CA USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2020年 / 102卷
关键词
Big Data; Clustering; Data exploration; ALGORITHM; DBSCAN;
D O I
10.1016/j.future.2019.07.077
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The rise of Big Data era calls for more efficient and effective Data Exploration and analysis tools. In this respect, the need to support advanced analytics on Big Data is driving data scientist' interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+ on platforms tailored for Big Data management. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:84 / 94
页数:11
相关论文
共 26 条
[1]  
[Anonymous], 1987, STAT DATA ANAL BASED
[2]  
[Anonymous], 2003, ICML
[3]  
[Anonymous], [No title captured]
[4]  
[Anonymous], 2009, Finding Groups in Data: An Introduction to Cluster Analysis
[5]   An extensive comparative study of cluster validity indices [J].
Arbelaitz, Olatz ;
Gurrutxaga, Ibai ;
Muguerza, Javier ;
Perez, Jesus M. ;
Perona, Inigo .
PATTERN RECOGNITION, 2013, 46 (01) :243-256
[6]   Scalable K-Means++ [J].
Bahmani, Bahman ;
Moseley, Benjamin ;
Vattani, Andrea ;
Kumar, Ravi ;
Vassilvitskii, Sergei .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07) :622-633
[7]  
Calinski R, 1974, COMMUN STAT, V3, P1, DOI [DOI 10.1080/03610927408827101, 10.1080/03610927408827101]
[8]  
Ferreira Cordeiro RobsonLeonardo., 2011, P 17 ACM SPECIAL INT, P690, DOI DOI 10.1145/2020408.2020516
[9]   MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data [J].
He, Yaobin ;
Tan, Haoyu ;
Luo, Wuman ;
Feng, Shengzhong ;
Fan, Jianping .
FRONTIERS OF COMPUTER SCIENCE, 2014, 8 (01) :83-99
[10]   A Communication Efficient Parallel DBSCAN Algorithm Based on Parameter Server [J].
Hu, Xu ;
Huang, Jun ;
Qiu, Minghui .
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, :2107-2110