Mapreduce-Based Distributed Clustering Method Using CF<sup> plus </sup> Tree

被引:3
作者
Ryu, Hyeong-Cheol [1 ]
Jung, Sungwon [1 ]
机构
[1] Sogang Univ, Dept Comp Sci & Engn, Seoul 121742, South Korea
基金
新加坡国家研究基金会;
关键词
Clustering; BIRCH; CF tree; range query; very large data sets; MapReduce; ALGORITHMS;
D O I
10.1109/ACCESS.2020.2999085
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF+-ERC. CF+-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF+ tree. However, CF+-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF+-ERC on MapReduce (CF+ERC_MR). It builds a CF+ tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.
引用
收藏
页码:104232 / 104246
页数:15
相关论文
共 29 条
[1]   A survey on clustering algorithms for wireless sensor networks [J].
Abbasi, Ameer Ahmed ;
Younis, Mohamed .
COMPUTER COMMUNICATIONS, 2007, 30 (14-15) :2826-2841
[2]  
Aggarwal C. C., 2012, Mining Text Data, P163, DOI [10.1007/978-1-4614-3223-4, DOI 10.1007/978-1-4614-3223-4]
[3]   A comparison of extrinsic clustering evaluation metrics based on formal constraints [J].
Amigo, Enrique ;
Gonzalo, Julio ;
Artiles, Javier ;
Verdejo, Felisa .
INFORMATION RETRIEVAL, 2009, 12 (04) :461-486
[4]  
[Anonymous], 2011, P INT C IM INF PROC
[5]  
Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027
[6]   Modified Quality Threshold Clustering for Temporal Analysis and Classification of Lung Lesions [J].
Barros Netto, Stelmo Magalhaes ;
Bandeira Diniz, Joao Otavio ;
Silva, Aristofanes Correa ;
de Paiva, Anselmo Cardoso ;
Nunes, Rofolfo Acatauassu ;
Gattass, Marcelo .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) :1813-1823
[7]  
Brank J., 2005, P C DATA MINING DATA, P166
[8]  
Chlamtac I, 1996, MILCOM 96, CONFERENCE PROCEEDINGS, VOLS 1-3, P108, DOI 10.1109/MILCOM.1996.568594
[9]   Automatic detection of solitary lung nodules using quality threshold clustering, genetic algorithm and diversity index [J].
de Carvalho Filho, Antonio Oseas ;
de Sampaio, Wener Borges ;
Silva, Aristofanes Correa ;
de Paivaa, Anselmo Cardoso ;
Nunes, Rodolfo Acatauassu ;
Gattass, Marcelo .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2014, 60 (03) :165-177
[10]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137