Adaptive Top-k Overlap Set Similarity Joins

被引：12

作者：

Yang, Zhong ^{[1
]}

Zheng, Bolong ^{[1
]}

Li, Guohui ^{[1
]}

Zhao, Xi ^{[1
]}

Zhou, Xiaofang ^{[2
]}

Jensen, Christian S. ^{[3
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] Univ Queensland, Brisbane, Qld, Australia

[3] Aalborg Univ, Aalborg, Denmark

来源：

2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020) | 2020年

关键词：

overlap set similarity; similarity join; top-k join; data lake; SEARCH;

D O I：

10.1109/ICDE48307.2020.00098

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.

引用

页码：1081 / 1092

页数：12

共 31 条

[1] [Anonymous], 2010, Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, DOI DOI 10.1145/1807167.1807222
[2] Arasu AGK06 Arvind, 2006, P 32 INT C VER LARG, P918
[3] Bayardo R.J., 2007, WWW, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]
[4] Bouros P, 2012, PROC VLDB ENDOW, V6, P1
[5] Syntactic clustering of the Web
Broder, AZ
Glassman, SC
Manasse, MS
Zweig, G
[J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1997, 29 (8-13): : 1157 - 1166
[6] Chaudhuri S., 2006, P 22 INT C DAT ENG I, P5, DOI DOI 10.1109/ICDE.2006.9
[7] Metric Similarity Joins Using MapReduce
Chen, Gang
Yang, Keyu
Chen, Lu
Gao, Yunjun
Zheng, Baihua
Chen, Chun
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 656 - 669
[8] Collection statistics for fast duplicate document detection
Chowdhury, A
Frieder, O
Grossman, D
McCabe, MC
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) : 171 - 191
[9] Set Similarity Search Beyond MinHash
Christiani, Tobias
Pagh, Rasmus
[J]. STOC'17: PROCEEDINGS OF THE 49TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING, 2017, : 1094 - 1107
[10] Cohen W. W., 1998, SIGMOD Record, V27, P201, DOI 10.1145/276305.276323

← 1 2 3 4 →