Adaptive Top-k Overlap Set Similarity Joins

被引:12
作者
Yang, Zhong [1 ]
Zheng, Bolong [1 ]
Li, Guohui [1 ]
Zhao, Xi [1 ]
Zhou, Xiaofang [2 ]
Jensen, Christian S. [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Univ Queensland, Brisbane, Qld, Australia
[3] Aalborg Univ, Aalborg, Denmark
来源
2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020) | 2020年
关键词
overlap set similarity; similarity join; top-k join; data lake; SEARCH;
D O I
10.1109/ICDE48307.2020.00098
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.
引用
收藏
页码:1081 / 1092
页数:12
相关论文
共 31 条
  • [1] [Anonymous], 2010, Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, DOI DOI 10.1145/1807167.1807222
  • [2] Arasu AGK06 Arvind, 2006, P 32 INT C VER LARG, P918
  • [3] Bayardo R.J., 2007, WWW, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]
  • [4] Bouros P, 2012, PROC VLDB ENDOW, V6, P1
  • [5] Syntactic clustering of the Web
    Broder, AZ
    Glassman, SC
    Manasse, MS
    Zweig, G
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1997, 29 (8-13): : 1157 - 1166
  • [6] Chaudhuri S., 2006, P 22 INT C DAT ENG I, P5, DOI DOI 10.1109/ICDE.2006.9
  • [7] Metric Similarity Joins Using MapReduce
    Chen, Gang
    Yang, Keyu
    Chen, Lu
    Gao, Yunjun
    Zheng, Baihua
    Chen, Chun
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 656 - 669
  • [8] Collection statistics for fast duplicate document detection
    Chowdhury, A
    Frieder, O
    Grossman, D
    McCabe, MC
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) : 171 - 191
  • [9] Set Similarity Search Beyond MinHash
    Christiani, Tobias
    Pagh, Rasmus
    [J]. STOC'17: PROCEEDINGS OF THE 49TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING, 2017, : 1094 - 1107
  • [10] Cohen W. W., 1998, SIGMOD Record, V27, P201, DOI 10.1145/276305.276323