Clustering data streams using grid-based synopsis

被引:32
作者
Bhatnagar, Vasudha [1 ]
Kaur, Sharanjit [2 ]
Chakravarthy, Sharma [3 ]
机构
[1] Univ Delhi, Dept Comp Sci, Delhi 110007, India
[2] Univ Delhi, Acharya Narendra Dev Coll, Dept Comp Sci, Delhi 110007, India
[3] Univ Texas Arlington, Dept Comp Sci & Engn, Arlington, TX 76019 USA
关键词
Stream clustering; Synopsis; Micro-cluster; Grid structure; Exclusive clustering; Complete clustering; DENSITY;
D O I
10.1007/s10115-013-0659-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple scans of data an impractical proposition for the purpose of knowledge discovery. Need to learn structures from data in streaming environment has been a driving force for making clustering a popular technique for knowledge discovery from data streams. Continuous nature of streaming data makes it infeasible to look for point membership among the clusters discovered so far, necessitating employment of a synopsis structure to consolidate incoming data points. This synopsis is exploited for building clustering scheme to meet subsequent user demands. The proposed Exclusive and Complete Clustering (ExCC) algorithm captures non-overlapping clusters in data streams with mixed attributes, such that each point either belongs to some cluster or is an outlier/noise. The algorithm is robust, adaptive to changes in data distribution and detects succinct outliers on-the-fly. It deploys a fixed granularity grid structure as synopsis and performs clustering by coalescing dense regions in grid. Speed-based pruning is applied to synopsis prior to clustering to ensure currency of discovered clusters. Extensive experimentation demonstrates that the algorithm is robust, identifies succinct outliers on-the-fly and is adaptive to change in the data distribution. ExCC algorithm is further evaluated for performance and compared with other contemporary algorithms.
引用
收藏
页码:127 / 152
页数:26
相关论文
共 55 条
[1]  
Abraham A, 2009, STUD COMPUT INTELL, V206, P1, DOI 10.1007/978-3-642-01091-0
[2]  
Aggarwal CC, 2006, SIAM PROC S, P479
[3]  
Aggarwal Charu C, 2007, Data Streams: Models and Algorithms, V31
[4]  
Agrawal R., 1998, P ACM SIGMOD INT C M
[5]  
[Anonymous], SIGKDD EXPLOR NEWSL
[6]  
[Anonymous], 2005, Data Mining: Concepts and Techniques
[7]  
[Anonymous], 2004, P 30 INT C VER LARG
[8]  
[Anonymous], 2003, P 29 INT C VER LARG
[9]  
[Anonymous], 2019, ENTROPY BASED SUBSPA, DOI DOI 10.1145/312129.312199
[10]  
[Anonymous], P ACM S APPL COMP SA