Distribution-free data density estimation in large-scale networks

被引:0
作者
Minqi Zhou
Rong Zhang
Weining Qian
Aoying Zhou
机构
[1] East China Normal University,Data Science and Engineering Institute
[2] Wuhan University,State Key Lab of Software Engineering
来源
Frontiers of Computer Science | 2018年 / 12卷
关键词
distribution-free; data density estimation; random sampling;
D O I
暂无
中图分类号
学科分类号
摘要
Estimating the global data distribution in large-scale networks is an important issue and yet to be well addressed. It can benefit many applications, especially in the cloud computing era, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate (number) generation, in this paper, we present a novel model called distribution-free data density estimation for large ring-based networks to achieve high estimation accuracy with low estimation cost regardless of the distribution models of the underlying data. This model generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. Armed with this estimation method, we can estimate data densities over both one-dimensional and multidimensional tuple sets, where each dimension could be either continuous or discrete as its domain. In large-scale networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which the global data distribution is estimated are introduced with a detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in large ring-based networks.
引用
收藏
页码:1220 / 1240
页数:20
相关论文
共 54 条
  • [1] Lakshman A(2010)Cassandra: a decentralized structured storage system ACM SIGOPS Operating Systems Review 44 35-40
  • [2] Malik P(2007)Dynamo: amazon’s highly available key-value store ACM SIGOPS Review 41 205-220
  • [3] DeCandia G(2006)Distributed data mining in peer-to-peer networks IEEE Internet Computing 10 18-26
  • [4] Hastorun D(1998)Wavelet-based histograms for selectivity estimation ACM SIGMoD Record 27 448-459
  • [5] Jampani M(1987)Stan ulam, john von neumann, and the monte carlo method Los Alamos Science 15 131-137
  • [6] Kakulapati G(1983)Estimating record selectivities Information Systems 8 105-115
  • [7] Lakshman A(2004)Mercury: supporting scalable multi-attribute range queries ACM SIGCOMM Computer Communication Review 34 353-366
  • [8] Pilchin A(2007)Sampling cluster endurance for peer-to-peer based content distribution networks Multimedia Systems 13 19-33
  • [9] Sivasubramanian S(2015)Stable distributed P2P protocols based on random peer sampling IEEE/ACM Transactions on Networking 23 1444-1456
  • [10] Vosshall P(2007)Gossip-based peer sampling ACM Transactions on Computer Systems 25 8-160