A segment-based framework for modeling and mining data streams

被引:15
作者
Aggarwal, Charu C. [1 ]
机构
[1] IBM TJ Watson Res Ctr, Hawthorne, NY 10532 USA
关键词
Clustering; Stream mining;
D O I
10.1007/s10115-010-0366-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data Streams have become ubiquitous in recent years because of advances in hardware technology which have enabled automated recording of large amounts of data. The primary constraint in the effective mining of streams is the large volume of data which must be processed in real time. In many cases, it is desirable to store a summary of the data stream segments in order to perform data mining tasks. Since density estimation provides a comprehensive overview of the probabilistic data distribution of a stream segment, it is a natural choice for this purpose. A direct use of density distributions can however turn out to be an inefficient storage and processing mechanism in practice. In this paper, we introduce the concept of cluster histograms, which provides an efficient way to estimate and summarize the most important data distribution profiles over different stream segments. These profiles can be constructed in a supervised or unsupervised way depending upon the nature of the underlying application. The profiles can also be used for change detection, anomaly detection, segmental nearest neighbor search, or supervised stream segment classification. Furthermore, these techniques can also be used for modeling other kinds of data such as text and categorical data. The flexibility of the tasks which can be performed from the cluster histogram framework follows from its generality in storing the historical density profile of the data stream. As a result, this method provides a holistic framework for density-based mining of data streams. We discuss and test the application of the cluster histogram framework to a variety of interesting data mining applications.
引用
收藏
页码:1 / 29
页数:29
相关论文
共 17 条
[1]  
Aggarwal C., 2003, ACM SIGMOD C
[2]  
AGGARWAL CC, 2003, FRAM CLUST EV DAT ST
[3]  
Aggarwal Charu C, 2007, Data Streams: Models and Algorithms, V31
[4]  
[Anonymous], VLDB C
[5]  
[Anonymous], SURVEY CLASSIFICATIO
[6]  
[Anonymous], 1986, DENSITY ESTIMATION S
[7]  
DOMINGOS P, 2000, ACM KDD C
[8]  
GUHA S, 2001, ACM S THEOR COMP
[9]  
HULTEN G, 2001, ACM KDD C
[10]  
INDYK P, 2000, VLDB C, P362