Efficient computation of frequent and top-k elements in data streams

被引:404
作者
Metwally, A [1 ]
Agrawal, D [1 ]
El Abbadi, A [1 ]
机构
[1] Univ Calif Santa Barbara, Dept Comp Sci, Santa Barbara, CA 93106 USA
来源
DATABASE THEORY - ICDT 2005, PROCEEDINGS | 2005年 / 3363卷
关键词
D O I
10.1007/978-3-540-30570-5_27
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k ' elements, where k ' approximate to k, which are guaranteed to be the top-k ' elements; and we use minimal space for calculating frequent elements. For realistic Zipfian data, our space requirement for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, we ensure that only the top-k elements, in the correct order, are reported. Our experiments show significant space reductions with no loss in accuracy.
引用
收藏
页码:398 / 412
页数:15
相关论文
共 16 条
[1]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[2]  
BOSE P, 2003, P 10 INT C STRUCT IN, P33
[3]  
BOYER R, 1981, 198132 U TEX I COMP
[4]  
Charikar M., 2002, P 29 INT C AUT LANG, P693, DOI 10.1007/3-540-45465-9_59
[5]  
Cormode G., 2003, ACM Transactions on Database Systems (TODS), P296, DOI DOI 10.1145/1061318.1061325
[6]  
DEMAINE ED, 2002, P 10 ANN EUR S ALG, P348
[7]   New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice [J].
Estan, C ;
Varghese, G .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2003, 21 (03) :270-313
[8]  
Fischer M.J., 1982, J ALGORITHMS, V3, P376
[9]  
Hoare C. A. R., 1961, COMMUN ACM, V4, P321, DOI DOI 10.1145/366622.366644
[10]  
JIN C, 2003, P 12 INT C INF KNOWL, P287