Set Similarity Joins on MapReduce: An Experimental Survey

被引：33

作者：

Fier, Fabian ^{[1
]}

Augsten, Nikolaus ^{[2
]}

Bouros, Panagiotis ^{[3
]}

Leser, Ulf ^{[1
]}

Freytag, Johann-Christoph ^{[1
]}

机构：

[1] Humboldt Univ, Unter Linden 6, D-10099 Berlin, Germany

[2] Univ Salzburg, Jakob Haringer Str 2, A-5020 Salzburg, Austria

[3] Johannes Gutenberg Univ Mainz, Saarstr 21, D-55122 Mainz, Germany

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 10期

基金：

奥地利科学基金会;

关键词：

PARTITION-BASED METHOD; EFFICIENT;

D O I：

10.14778/3231751.3231760

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Set similarity joins, which compute pairs of similar sets, constitute an important operator primitive in a variety of applications, including applications that must process large amounts of data. To handle these data volumes, several distributed set similarity join algorithms have been proposed. Unfortunately, little is known about the relative performance, strengths and weaknesses of these techniques. Previous comparisons are limited to a small subset of relevant algorithms, and the large differences in the various test setups make it hard to draw overall conclusions. In this paper we survey ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm. We empirically compare the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications. Our experiments yield a surprising result: All algorithms in our test fail to scale for at least one dataset and are sensitive to long sets, frequent set elements, low similarity thresholds, or a combination thereof. Interestingly, some algorithms even fail to handle the small datasets that can easily be processed in a non-distributed setting. Our analytic investigation of the algorithms pinpoints the reasons for the poor performance and targeted experiments confirm our analytic findings. Based on our investigation, we suggest directions for future research in the area.

引用

页码：1110 / 1122

页数：13

共 33 条

[1] Fuzzy Joins Using MapReduce [J].

Afrati, Foto N. ;

Das Sarma, Anish ;

Menestrina, David ;

Parameswaran, Aditya ;

Ullman, Jeffrey D. .

2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, :498-509

[2]

[Anonymous], 2007, WWW INT C WORLD WID, DOI [10.1145/1242572.1242591, DOI 10.1145/1242572.1242591]

[3]

[Anonymous], 2006, ICDE, DOI [DOI 10.1109/ICDE.2006.9, 10.1109/ICDE.2006.9]

[4]

[Anonymous], 2007, WWW INT C WORLD WID

[5]

Arasu A., 2006, Proceedings of VLDB Endowment, P918

[6] On-the-Fly Token Similarity Joins in Relational Databases [J].

Augsten, Nikolaus ;

Miraglia, Armando ;

Neumann, Thomas ;

Kemper, Alfons .

SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, :1495-1506

[7] The pq-Gram Distance between Ordered Labeled Trees [J].

Augsten, Nikolaus ;

Boehlen, Michael ;

Gamper, Johann .

ACM TRANSACTIONS ON DATABASE SYSTEMS, 2010, 35 (01)

[8]

Baraglia Ranieri, 2010, Proceedings 2010 10th IEEE International Conference on Data Mining (ICDM 2010), P731, DOI 10.1109/ICDM.2010.70

[9] ClusterJoin: A Similarity Joins Framework using Map-Reduce [J].

Das Sarma, Akash ;

He, Yeye ;

Chaudhuri, Surajit .

PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (12) :1059-1070

[10]

Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137

← 1 2 3 4 →