Finding Frequent Items in Data Streams

被引:159
作者
Cormode, Graham [1 ]
Hadjieleftheriou, Marios [1 ]
机构
[1] AT&T Labs Res, Florham Pk, NJ 07932 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2008年 / 1卷 / 02期
关键词
D O I
10.14778/1454159.1454225
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
引用
收藏
页码:1530 / 1541
页数:12
相关论文
共 39 条
[1]  
Alon N., 1996, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, P20, DOI 10.1145/237814.237823
[2]  
Arasu A., 2004, ACM PODS
[3]  
Bandi N., 2007, ACM SIGMOD
[4]  
Bhattacharrya S., 2007, SCAL STREAM PROC SYS
[5]  
Bhuvanagiri L., 2006, ACM SIAM S DISCR ALG
[6]  
Blum A., 2004, IRPT0423 INT RES
[7]  
BOSE P, 2003, SIROCCO
[8]  
Boyer B., 1981, ICSCACMP32 U TEX
[9]  
Boyer R. S., 1991, AUTOMATED REASONING, P105, DOI DOI 10.1007/978-94-011-3488-0_5
[10]  
Chakrabarti A., 2007, ACM SIAM S DISCR ALG