Distributed stream clustering using micro-clusters on Apache Storm

被引:19
作者
Karunaratne, Pasan [1 ]
Karunasekera, Shanika [1 ]
Harwood, Aaron [1 ]
机构
[1] Univ Melbourne, Dept Comp & Informat Syst, Parkville, Vic 3010, Australia
关键词
Stream data mining; Distributed data mining; Stream clustering;
D O I
10.1016/j.jpdc.2016.06.004
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The recent need to extract real-time insights from data has driven the need for machine learning algorithms that can operate on data streams. Given the current extreme rates of data generation (around 5000 messages per second), these algorithms need to be able to handle data streams of very high velocity. Many current algorithms do not reach this requirement, in some cases processing only tens of messages per second. In this work we address the problem of limited achievable throughput of stream clustering by developing scalable distributed algorithms based on the micro-clustering paradigm that run on cloud platforms. We present two distributed architectures to execute the algorithms in parallel and implement these architectures on the Apache Storm stream processing platform. We demonstrate that we are able to gain close to an order of magnitude of improvement of performance in our experiments. (C) 2016 Elsevier Inc. All rights reserved.
引用
收藏
页码:74 / 84
页数:11
相关论文
共 34 条
  • [1] Ackermann M. R., 2012, Journal of Experimental Algorithmics (JEA), V17, DOI [DOI 10.1145/2133803.2184450, 10.1145/2133803.2184450]
  • [2] Agerri R., KNOWLEDGE BASED SYST
  • [3] Aggarwal C.C., 2013, A Survey of Stream Clustering Algorithms
  • [4] Aggarwal C.C., 2004, Proceedings of the Thirtieth International Conference on Very Large Data Bases-Volume 30, VLDB '04
  • [5] Aggarwal CC, 2003, P 2003 VLDB C, V29, P81, DOI DOI 10.1016/B978-012722442-8/50016-1
  • [6] Aggarwal CC, 2008, PROC INT CONF DATA, P150, DOI 10.1109/ICDE.2008.4497423
  • [7] Aniello L., 2013, P 7 ACM INT C DISTR, P207
  • [8] [Anonymous], 2004, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, DOI DOI 10.1145/1014052.1014110
  • [9] [Anonymous], TECH REP
  • [10] Clustering distributed data streams in peer-to-peer environments
    Bandyopadhyay, Sanghamitra
    Giannella, Chris
    Maulik, Ujjwal
    Kargupta, Hillol
    Liu, Kun
    Datta, Souptik
    [J]. INFORMATION SCIENCES, 2006, 176 (14) : 1952 - 1985