Mergeable Summaries

被引:99
作者
Agarwal, Pankaj K. [1 ]
Cormode, Graham [2 ,3 ]
Huang, Zengfeng [4 ,7 ]
Phillips, Jeff M. [5 ]
Wei, Zhewei [4 ,7 ]
Yi, Ke [6 ,7 ]
机构
[1] Duke Univ, Durham, NC USA
[2] Univ Warwick, Coventry CV4 7AL, W Midlands, England
[3] AT&T Labs Res, Middletown, NJ 07748 USA
[4] Aarhus Univ, Aarhus, Denmark
[5] Univ Utah, Salt Lake City, UT USA
[6] Tsinghua Univ, Beijing 100084, Peoples R China
[7] HKUST, Hong Kong, Hong Kong, Peoples R China
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2013年 / 38卷 / 04期
基金
美国国家科学基金会;
关键词
Algorithms; Data summarization; heavy hitters; quantiles; BOUNDS; COMPLEXITY; FREQUENT;
D O I
10.1145/2500128
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriatemodifications. Specifically, we show that for epsilon-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/epsilon); for epsilon-approximate quantiles, there is a deterministic summary of size O((1/epsilon) log(epsilon n)) that has a restricted form of mergeability, and a randomized one of size O((1/epsilon) log(3/2)(1/epsilon)) with full mergeability. We also extend our results to geometric summaries such as epsilon-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for epsilon-approximate quantiles that depends only on epsilon, of size O((1/epsilon) log(3/2)(1/epsilon)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.
引用
收藏
页数:28
相关论文
共 40 条
[1]  
Ahn K.J., 2012, P ACM SIAM S DISCR A
[2]   The space complexity of approximating the frequency moments [J].
Alon, N ;
Matias, Y ;
Szegedy, M .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1999, 58 (01) :137-147
[3]  
[Anonymous], 2004, Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys), DOI DOI 10.1145/1031495.1031524
[4]  
[Anonymous], P ACM S PRINC DAT SY
[5]  
[Anonymous], 2012, Proceedings of the 31st symposium on Principles of Database Systems, DOI [10.1145/2213556.2213562, DOI 10.1145/2213556.2213562]
[6]   Semidefinite optimization in discrepancy theory [J].
Bansal, Nikhil .
MATHEMATICAL PROGRAMMING, 2012, 134 (01) :5-22
[7]   Constructive Algorithms for Discrepancy Minimization [J].
Bansal, Nikhil .
2010 IEEE 51ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, 2010, :3-10
[8]  
Bar-Yossef Z., 2002, Randomization and Approximation Techniques in Computer Science. 6th International Workshop, RANDOM 2002. Proceedings (Lecture Notes in Computer Science Vol.2483), P1
[9]   Space-Optimal Heavy Hitters with Strong Error Bounds [J].
Berinde, Radu ;
Indyk, Piotr ;
Cormode, Graham ;
Strauss, Martin J. .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2010, 35 (04)
[10]   On linear-time deterministic algorithms for optimization problems in fixed dimension [J].
Chazelle, B ;
Matousek, J .
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 1996, 21 (03) :579-597