Set Similarity Joins on MapReduce: An Experimental Survey

被引:32
作者
Fier, Fabian [1 ]
Augsten, Nikolaus [2 ]
Bouros, Panagiotis [3 ]
Leser, Ulf [1 ]
Freytag, Johann-Christoph [1 ]
机构
[1] Humboldt Univ, Unter Linden 6, D-10099 Berlin, Germany
[2] Univ Salzburg, Jakob Haringer Str 2, A-5020 Salzburg, Austria
[3] Johannes Gutenberg Univ Mainz, Saarstr 21, D-55122 Mainz, Germany
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 10期
基金
奥地利科学基金会;
关键词
PARTITION-BASED METHOD; EFFICIENT;
D O I
10.14778/3231751.3231760
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Set similarity joins, which compute pairs of similar sets, constitute an important operator primitive in a variety of applications, including applications that must process large amounts of data. To handle these data volumes, several distributed set similarity join algorithms have been proposed. Unfortunately, little is known about the relative performance, strengths and weaknesses of these techniques. Previous comparisons are limited to a small subset of relevant algorithms, and the large differences in the various test setups make it hard to draw overall conclusions. In this paper we survey ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm. We empirically compare the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications. Our experiments yield a surprising result: All algorithms in our test fail to scale for at least one dataset and are sensitive to long sets, frequent set elements, low similarity thresholds, or a combination thereof. Interestingly, some algorithms even fail to handle the small datasets that can easily be processed in a non-distributed setting. Our analytic investigation of the algorithms pinpoints the reasons for the poor performance and targeted experiments confirm our analytic findings. Based on our investigation, we suggest directions for future research in the area.
引用
收藏
页码:1110 / 1122
页数:13
相关论文
共 18 条
  • [1] Overlap Set Similarity Joins with Theoretical Guarantees
    Deng, Dong
    Tao, Yufei
    Li, Guoliang
    SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 905 - 920
  • [2] A Two -Level Signature Scheme for Stable Set Similarity Joins
    Schmitt, Daniel
    Kocher, Daniel
    Augsten, Nikolaus
    Mann, Willi
    Miller, Alexander
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (11): : 2686 - 2698
  • [3] Similarity Joins of Sparse Features
    Metwally, Ahmed
    Shum, Michael
    COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 80 - 92
  • [4] Scalable Similarity Joins of Tokenized Strings
    Metwally, Ahmed
    Huang, Chun-Heng
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1766 - 1777
  • [5] Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework
    Ma, Youzhong
    Zhang, Ruiling
    Cui, Zhanyou
    Lin, Chunjie
    IEEE ACCESS, 2020, 8 : 121665 - 121677
  • [6] An Experimental Study on Federated Equi-Joins
    Li, Shuyuan
    Zeng, Yuxiang
    Wang, Yuxiang
    Zhong, Yiman
    Zhou, Zimu
    Tong, Yongxin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (09) : 4443 - 4457
  • [7] A Survey on Parallel Join Algorithms Using MapReduce on Hadoop
    Barhoush, Malek Mahmoud
    AlSobeh, Anas Mohammad
    Al Rawashdeh, Ahmad
    2019 IEEE JORDAN INTERNATIONAL JOINT CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION TECHNOLOGY (JEEIT), 2019, : 381 - 388
  • [8] A Trie Based Set Similarity Query Algorithm
    Jia, Lianyin
    Tang, Junzhuo
    Li, Mengjuan
    Li, Runxin
    Ding, Jiaman
    Chen, Yinong
    MATHEMATICS, 2023, 11 (01)
  • [9] Subgraph Matching with Set Similarity in a Large Graph Database
    Hong, Liang
    Zou, Lei
    Lian, Xiang
    Yu, Philip S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (09) : 2507 - 2521
  • [10] A Transformation-Based Framework for KNN Set Similarity Search
    Zhang, Yong
    Wu, Jiacheng
    Wang, Jin
    Xing, Chunxiao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (03) : 409 - 423