Computing Hierarchical Summary of the Data Streams

被引:3
作者
Shah, Zubair [1 ]
Mahmood, Abdun Naser [1 ]
Barlow, Michael [1 ]
机构
[1] Univ New South Wales, Canberra, ACT, Australia
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2016, PT II | 2016年 / 9652卷
关键词
Hierarchical heavy hitters; Network traffic summarization; Heavy hitters; Data streams; FREQUENT;
D O I
10.1007/978-3-319-31750-2_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data stream processing is an important function in many online applications such as network traffic analysis, web applications, and financial data analysis. Computing summaries of data stream is challenging since streaming data is generally unbounded, and cannot be permanently stored or accessed more than once. In this paper, we have proposed two counter based hierarchical (CHS) is an element of-approximation algorithms to create hierarchical summaries of one dimensional data. CHS maintains a data structure, where each entry contains the incoming data item and an associated counter to store its frequency. Since every item in streaming data cannot be stored, CHS only maintains frequent items (known as hierarchical heavy hitters) at various levels of generalization hierarchy by exploiting the natural hierarchy of the data. The algorithm guarantees accuracy of count within an is an element of bound. Furthermore, using aperiodic (CHS-A) and periodic (CHS-P) compression strategy the proposed technique offers improved space complexities of O(eta/is an element of) and O(eta/is an element of log is an element of N), respectively. We provide theoretical proofs for both space and time requirements of CHS algorithm. We have also experimentally compared the proposed algorithm with the existing benchmark techniques. Experimental results show that the proposed algorithm requires fewer updates per element of data, and uses a moderate amount of bounded memory. Moreover, precision-recall analysis demonstrates that CHS algorithm provides a high quality output compared to existing benchmark techniques. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark data sets from an international Internet Service Provider.
引用
收藏
页码:168 / 179
页数:12
相关论文
共 17 条
  • [1] [Anonymous], 2013 I E LAT AM C CO
  • [2] [Anonymous], 2008, ACM Trans. Knowl. Discov., DOI DOI 10.1145/1324172.1324174DATA
  • [3] [Anonymous], ACM SIGCOMM
  • [4] Charikar M, 2002, LECT NOTES COMPUT SC, V2380, P693
  • [5] New directions in traffic measurement and accounting
    Estan, C
    Varghese, G
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2002, 32 (04) : 323 - 336
  • [6] Exploiting hierarchical domain structure to compute similarity
    Ganesan, P
    Garcia-Molina, H
    Widom, J
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2003, 21 (01) : 64 - 93
  • [7] Hernandez C., 2014, THESIS
  • [8] Hershberger J., 2005, P 24 ACM SIGMOD SIGA, P338, DOI DOI 10.1145/1065167.1065211
  • [9] Jose L, 2011, P USENIX HOTICE WORK
  • [10] Kalliola Aapo, 2014, Secure IT Systems 19th Nordic Conference, NordSec 2014. Proceedings: LNCS 8788, P213, DOI 10.1007/978-3-319-11599-3_13