A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

被引：0

作者：

Phan A.-C. ^{[1
]}

Phan T.-C. ^{[2
]}

Trieu T.-N. ^{[2
]}

Tran T.-T.-Q. ^{[3
]}

机构：

[1] Vinh Long University of Technology Education, Vinh Long City

[2] Can Tho University, Can Tho City

[3] IRISA, University of Rennes 1, Lannion City

来源：

SN Computer Science | 2021年 / 2卷 / 5期

关键词：

Big data analytic; Bloom filter; Join operation; MapReduce; Spark;

D O I：

10.1007/s42979-021-00738-x

中图分类号：

学科分类号：

摘要：

Currently, the estimated amount of data created daily have reached the threshold of petabytes or even zettabytes globally. It is no wonder that traditional data processing technologies cannot process and manage extremely large volumes of such data. However, these massive and various data can be used to deal with business problems that we would not have been able to tackle before. To discover their value, it is necessary to effectively perform query operations in a parallel and distributed manner. One of the standard and common query operations is an expensive join operation. This research systematically presents a theoretical and experimental comparison of the prominent join algorithms in the Spark environment. At first, this study shows the details of important strategies of two-way joins and recursive joins. Then, it exposes the advantages and disadvantages of each approach. Especially, the work provides mathematical cost models to make a more convince comparison of the joins before verifying by experiments. The results show that the comparison using the cost models is consistent with that using the experiments. Generally, the two-way and recursive joins using filters are the best choices while performing in the Spark environment. © 2021, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.

引用

共 50 条

[21] A LARGE-SCALE EXPERIMENTAL COMPARISON OF THE TETRAD AND TRIANGLE TESTS IN CHILDREN
Garcia, Karen
Ennis, John M.
Prinyawiwatkul, Witoon
JOURNAL OF SENSORY STUDIES, 2012, 27 (04) : 217 - 222
[22] PARAFAC algorithms for large-scale problems
Anh Huy Phan
Cichocki, Andrzej
NEUROCOMPUTING, 2011, 74 (11) : 1970 - 1984
[23] Algorithms for large-scale flat placement
Vygen, J
DESIGN AUTOMATION CONFERENCE - PROCEEDINGS 1997, 1997, : 746 - 751
[24] Algorithms for large-scale genotyping microarrays
Liu, WM
Di, XJ
Yang, G
Matsuzaki, H
Huang, J
Mei, R
Ryder, TB
Webster, TA
Dong, SL
Liu, GY
Jones, KW
Kennedy, GC
Kulp, D
BIOINFORMATICS, 2003, 19 (18) : 2397 - 2403
[25] Optimization Algorithms for Large-Scale Systems
Azizan N.
Performance Evaluation Review, 2020, 47 (03): : 2 - 5
[26] EXPERIMENTAL COMPARISON OF 2 HEURISTIC ALGORITHMS FOR ONE GENERALIZATION OF THE LARGE-SCALE PLANAR TRAVELING-SALESMAN PROBLEM
GERSHUNI, DS
SHERSTYUK, AV
PROGRAMMING AND COMPUTER SOFTWARE, 1990, 16 (04) : 171 - 173
[27] Join Algorithms under Apache Spark: Revisited
Al-Badarneh, Amer
PROCEEDINGS OF THE 2019 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND TECHNOLOGY APPLICATIONS (ICCTA 2019), 2019, : 56 - 62
[28] Scalable Algorithms for Bayesian Inference of Large-Scale Models from Large-Scale Data
Ghattas, Omar
Isaac, Tobin
Petra, Noemi
Stadler, Georg
HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 3 - 6
[29] Appraising SPARK on Large-Scale Social Media Analysis
Belcastro, Loris
Marozzo, Fabrizio
Talia, Domenico
Trunfio, Paolo
EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 483 - 495
[30] Parallelism and Partitioning in Large-Scale GAs using Spark
Alterkawi, Laila
Migliavacca, Matteo
PROCEEDINGS OF THE 2019 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'19), 2019, : 736 - 744

← 1 2 3 4 5 →