A High-Performance Algorithm for Identifying Frequent Items in Data Streams

被引:16
作者
Anderson, Daniel [1 ]
Bevan, Pryce [1 ]
Lang, Kevin [2 ]
Liberty, Edo [3 ]
Rhodes, Lee [4 ]
Thaler, Justin [1 ]
机构
[1] Georgetown Univ, Washington, DC 20057 USA
[2] Georgetown Univ, Washington, DC 20057 USA
[3] Amazon, Washington, DC 20057 USA
[4] Oath, Washington, DC 20057 USA
来源
PROCEEDINGS OF THE 2017 INTERNET MEASUREMENT CONFERENCE (IMC'17) | 2017年
关键词
streaming algorithms; mergeable summaries; frequent items; FINDING FREQUENT;
D O I
10.1145/3131365.3131407
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Estimating frequencies of items over data streams is a common building block in streaming data measurement and analysis. Misra and Gries introduced their seminal algorithm for the problem in 1982, and the problem has since been revisited many times due its practicality and applicability. We describe a highly optimized version of Misra and Gries' algorithm that is suitable for deployment in industrial settings. Our code is made public via an open source library called Data Sketches that is already used by several companies and production systems. Our algorithm improves on two theoretical and practical aspects of prior work. First, it handles weighted updates in amortized constant time, a common requirement in practice. Second, it uses a simple and fast method for merging summaries that asymptotically improves on prior work even for unweighted streams. We describe experiments confirming that our algorithms are more efficient than prior proposals.
引用
收藏
页码:268 / 282
页数:15
相关论文
共 48 条
[1]   Mergeable Summaries [J].
Agarwal, Pankaj K. ;
Cormode, Graham ;
Huang, Zengfeng ;
Phillips, Jeff M. ;
Wei, Zhewei ;
Yi, Ke .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (04)
[2]  
Ailon N., 2013, Proceedings of the 6th ACM International Conference on Web Search and Data Mining, WSDM, P405
[3]  
[Anonymous], 2003, P 3 ACM SIGCOMM INTE, DOI [DOI 10.1145/948205.948227, 10.1145/948205.948227]
[4]  
[Anonymous], PROC ICDT
[5]  
[Anonymous], 2016, The CAIDA UCSD Anonymized Internet Traces
[6]  
[Anonymous], 2016, P 17 ITALIAN C THEOR
[7]  
[Anonymous], 2004, ACM SIGMOD
[8]  
BASAT R. B., 2017, ACM SIGCOMM 2017
[9]  
Basat Ran Ben, 2017, IEEE INFOCOM 2017
[10]   Space-Optimal Heavy Hitters with Strong Error Bounds [J].
Berinde, Radu ;
Indyk, Piotr ;
Cormode, Graham ;
Strauss, Martin J. .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2010, 35 (04)