Fine-grained probability counting for cardinality estimation of data streams

被引:0
作者
Lun Wang
Tong Yang
Hao Wang
Jie Jiang
Zekun Cai
Bin Cui
Xiaoming Li
机构
[1] Peking University,Department of Computer Science
来源
World Wide Web | 2019年 / 22卷
关键词
Cardinality estimation; Probability counting; Network measurement; Data streams;
D O I
暂无
中图分类号
学科分类号
摘要
Estimating the number of distinct flows, also called the cardinality, is an important issue in many network applications, such as traffic measurement, anomaly detection, etc. The challenge is that high accuracy should be achieved with line speed and small auxiliary memory. Flajolet-Martin algorithm, LogLog algorithm, and HyperLogLog algorithm form a line of work in this area with improving performance. In this paper, we propose refined versions of these algorithms to achieve higher accuracy. The key observations are (1) the “leftmost” hash functions used by these algorithms can be generalized to reach higher accuracy, (2) the amendment coefficient can be highly biased in some certain streams or datasets so dynamically setting the amendment coefficient instead of using the one derived in pure math can lead to much better accuracy. Experimental results show great improvement of accuracy and stability of the refined versions over original algorithms.
引用
收藏
页码:2065 / 2081
页数:16
相关论文
共 36 条
[1]  
Dai H(2016)Finding persistent items in data streams Proc. VLDB Endow. 10 289-300
[2]  
Shahzad M(1990)On adaptive sampling Computing 43 391-400
[3]  
Liu AX(1985)Probabilistic counting algorithms for data base applications J. Comput. Syst. Sci. 31 182-209
[4]  
Zhong Y(2007)Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm Anal. Algor. 2007 127-146
[5]  
Flajolet P(2011)Hadi: Mining radii of large graphs ACM Trans. Knowl. Discov. Data (TKDD) 5 8-242
[6]  
Flajolet P(2008)Scalable data dissemination for inter-vehicle-communication: aggregation versus peer-to-peer (skalierbare informationsverbreitung für die fahrzeug-fahrzeug-kommunikation: Aggregation versus peer-to-peer) it-Information Technology 50 237-530
[7]  
Martin GN(2010)A probabilistic method for cooperative hierarchical aggregation of data in vanets Ad Hoc Netw. 8 518-1661
[8]  
Flajolet P(2012)Mining frequent itemsets over uncertain databases Proc. VLDB Endow 5 1650-712
[9]  
Fusy É(2015)Mining frequent itemsets in correlated uncertain databases J. Comput. Sci. Technol. 30 696-604
[10]  
Gandouet O(2016)Tracking frequent items over distributed probabilistic data World Wide Web 19 579-229