Density-Based Clustering of Data Streams at Multiple Resolutions

被引:101
作者
Wan, Li [1 ]
Ng, Wee Keong [1 ]
Dang, Xuan Hong [2 ]
Yu, Philip S. [3 ]
Zhang, Kuan [4 ]
机构
[1] Nanyang Technol Univ, Singapore 639798, Singapore
[2] Inst Infocomm Res, Singapore, Singapore
[3] Univ Illinois, Chicago, IL USA
[4] Singapore Management Univ, Singapore, Singapore
关键词
Data mining algorithms; density based clustering; evolving data streams; ALGORITHM;
D O I
10.1145/1552303.1552307
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In data stream clustering, it is desirable to have algorithms that are able to detect clusters of arbitrary shape, clusters that evolve over time, and clusters with noise. Existing stream data clustering algorithms are generally based on an online-offline approach: The online component captures synopsis information from the data stream (thus, overcoming real-time and memory constraints) and the offline component generates clusters using the stored synopsis. The online-offline approach affects the overall performance of stream data clustering in various ways: the ease of deriving synopsis from streaming data; the complexity of data structure for storing and managing synopsis; and the frequency at which the offline component is used to generate clusters. In this article, we propose an algorithm that (1) computes and updates synopsis information in constant time; (2) allows users to discover clusters at multiple resolutions; (3) determines the right time for users to generate clusters from the synopsis information; (4) generates clusters of higher purity than existing algorithms; and (5) determines the right threshold function for density-based clustering based on the fading model of stream data. To the best of our knowledge, no existing data stream algorithms has all of these features. Experimental results show that our algorithm is able to detect arbitrarily shaped, evolving clusters with high quality.
引用
收藏
页数:28
相关论文
共 17 条
  • [1] [Anonymous], 2003, P 29 INT C VER LARG
  • [2] [Anonymous], 2006, SIGKDD Conference on Knowledge Discovery and Data Mining
  • [3] Babcock B., 2003, P 22 ACM SIGMOD SIGA, P234, DOI DOI 10.1145/773153.773176
  • [4] Cao F., 2006, P SIAM C DAT MIN
  • [5] Charikar, 2003, P 35 ANN ACM S THEOR, P30, DOI DOI 10.1145/780542.780548
  • [6] CHEN Y, 2007, P ACM SIGKDD INT C K
  • [7] Clustering on demand for multiple data streams
    Dai, BR
    Huang, JW
    Yeh, MY
    Chen, MS
    [J]. FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 367 - 370
  • [8] A local-density based spatial clustering algorithm with noise
    Duan, Lian
    Xu, Lida
    Guo, Feng
    Lee, Jun
    Yan, Baopin
    [J]. INFORMATION SYSTEMS, 2007, 32 (07) : 978 - 986
  • [9] El-Sonbaty Y, 2004, PROC INT C TOOLS ART, P673
  • [10] Ester M., 1996, KDD-96 Proceedings. Second International Conference on Knowledge Discovery and Data Mining, P226