Clustering Data Streams Based on Shared Density between Micro-Clusters

被引:101
作者
Hahsler, Michael [1 ]
Bolanos, Matthew [2 ]
机构
[1] So Methodist Univ, Dept Engn Management Informat & Syst, Dallas, TX 75226 USA
[2] Res Now, Plano, TX 75024 USA
基金
美国国家科学基金会;
关键词
Data mining; data stream clustering; density-based clustering;
D O I
10.1109/TKDE.2016.2522412
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro-clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the microclusters into larger final clusters. For reclustering, the centers of the micro-clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-cluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller microclusters to achieve comparable results.
引用
收藏
页码:1449 / 1461
页数:13
相关论文
共 35 条
  • [1] Aggarwal Charu, 2007, DATA STREAMS MODELS, DOI DOI 10.1007/978-0-387-47534-9
  • [2] Amini A., 2013, Journal of Computer and Communications, V1, P26, DOI DOI 10.4236/JCC.2013.15005
  • [3] [Anonymous], 2004, P 30 INT C VER LARG
  • [4] [Anonymous], 2003, P 29 INT C VER LARG
  • [5] [Anonymous], 2011, P SIAM INT C DAT MIN
  • [6] [Anonymous], NEUROCOMPUTING FDN R
  • [7] Bentley J. L., 1975, CSTR75513 STANF LIN
  • [8] MULTIDIMENSIONAL BINARY SEARCH TREES USED FOR ASSOCIATIVE SEARCHING
    BENTLEY, JL
    [J]. COMMUNICATIONS OF THE ACM, 1975, 18 (09) : 509 - 517
  • [9] Bifet A., J MACH LEARN RES, V99, P1601
  • [10] Efficient Online Evaluation of Big Data Stream Classifiers
    Bifet, Albert
    Morales, Gianmarco De Francisci
    Read, Jesse
    Holmes, Geoff
    Pfahringer, Bernhard
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 59 - 68