Distributed Streaming Set Similarity Join

被引:9
作者
Yang, Jianye [1 ]
Zhang, Wenjie [2 ]
Wang, Xiang [2 ]
Zhang, Ying [3 ]
Lin, Xuemin [2 ]
机构
[1] Hunan Univ, Changsha, Peoples R China
[2] Univ New South Wales, Sydney, NSW, Australia
[3] Univ Technol Sydney, CAI, Sydney, NSW, Australia
来源
2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020) | 2020年
关键词
MAPREDUCE;
D O I
10.1109/ICDE48307.2020.00055
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.
引用
收藏
页码:565 / 576
页数:12
相关论文
共 36 条
  • [1] Fuzzy Joins Using MapReduce
    Afrati, Foto N.
    Das Sarma, Anish
    Menestrina, David
    Parameswaran, Aditya
    Ullman, Jeffrey D.
    [J]. 2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 498 - 509
  • [2] [Anonymous], 2012, SIGMOD
  • [3] [Anonymous], 2006, ICDE
  • [4] [Anonymous], 2006, Proceedings of the 32Nd International Conference on Very Large Data Bases
  • [5] [Anonymous], 1997, Information theory and statistics
  • [6] Bayardo R.J., 2007, Proceedings of the 16th international conference on World Wide Web, P131, DOI [10.1145/1242572.1242591, DOI 10.1145/1242572.1242591]
  • [7] Bouros P, 2012, PROC VLDB ENDOW, V6, P1
  • [8] Min-wise independent permutations
    Broder, AZ
    Charikar, M
    Frieze, AM
    Mitzenmacher, M
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2000, 60 (03) : 630 - 659
  • [9] Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
  • [10] Overlap Set Similarity Joins with Theoretical Guarantees
    Deng, Dong
    Tao, Yufei
    Li, Guoliang
    [J]. SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 905 - 920