Maintaining stream statistics over sliding windows

被引:274
作者
Datar, M [1 ]
Gionis, A
Indyk, P
Motwani, R
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[2] MIT, Comp Sci Lab, Cambridge, MA 02139 USA
关键词
statistics; data streams; sliding windows; approximation algorithms;
D O I
10.1137/S0097539701398363
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We consider the problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far. We refer to this model as the sliding window model. We consider the following basic problem: Given a stream of bits, maintain a count of the number of 1 s in the last N elements seen from the stream. We show that, using O(1/epsilon log(2) N) bits of memory, we can estimate the number of 1 s to within a factor of 1 + epsilon. We also give a matching lower bound of Omega(1/epsilon log(2) N) memory bits for any deterministic or randomized algorithms. We extend our scheme to maintain the sum of the last N positive integers and provide matching upper and lower bounds for this more general problem as well. We also show how to efficiently compute the L-p norms (p is an element of[1, 2]) of vectors in the sliding window model using our techniques. Using our algorithm, one can adapt many other techniques to work for the sliding window model with a multiplicative overhead of O(1/epsilon log N) in memory and a 1 + epsilon factor loss in accuracy. These include maintaining approximate histograms, hash tables, and statistics or aggregates such as sum and averages.
引用
收藏
页码:1794 / 1813
页数:20
相关论文
共 16 条
[1]  
Alon N., 1996, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, P20, DOI 10.1145/237814.237823
[2]  
[Anonymous], P 1998 INT C VER LAR
[3]  
Chaudhuri S, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P263, DOI 10.1145/304181.304206
[4]  
*CISC SYST, 2000, NETFL SERV APPL
[5]  
Cortes C., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P9, DOI 10.1145/347090.347094
[6]  
FEIGENBAUM J, 1999, P 40 ANN S FDN COMP, P501
[7]  
FRALEIGH C, 2000, TR00ATL101801 SPRINT
[8]  
Gilbert A. C., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P79
[9]   Approximating a data stream for querying and estimation: Algorithms and performance evaluation [J].
Guha, S ;
Koudas, N .
18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, :567-576
[10]   Clustering data streams [J].
Guha, S ;
Mishra, N ;
Motwani, R ;
O'Callaghan, L .
41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, :359-366