A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

被引：0

作者：

Phan A.-C. ^{[1
]}

Phan T.-C. ^{[2
]}

Trieu T.-N. ^{[2
]}

Tran T.-T.-Q. ^{[3
]}

机构：

[1] Vinh Long University of Technology Education, Vinh Long City

[2] Can Tho University, Can Tho City

[3] IRISA, University of Rennes 1, Lannion City

来源：

SN Computer Science | 2021年 / 2卷 / 5期

关键词：

Big data analytic; Bloom filter; Join operation; MapReduce; Spark;

D O I：

10.1007/s42979-021-00738-x

中图分类号：

学科分类号：

摘要：

Currently, the estimated amount of data created daily have reached the threshold of petabytes or even zettabytes globally. It is no wonder that traditional data processing technologies cannot process and manage extremely large volumes of such data. However, these massive and various data can be used to deal with business problems that we would not have been able to tackle before. To discover their value, it is necessary to effectively perform query operations in a parallel and distributed manner. One of the standard and common query operations is an expensive join operation. This research systematically presents a theoretical and experimental comparison of the prominent join algorithms in the Spark environment. At first, this study shows the details of important strategies of two-way joins and recursive joins. Then, it exposes the advantages and disadvantages of each approach. Especially, the work provides mathematical cost models to make a more convince comparison of the joins before verifying by experiments. The results show that the comparison using the cost models is consistent with that using the experiments. Generally, the two-way and recursive joins using filters are the best choices while performing in the Spark environment. © 2021, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.

引用

共 50 条

[1] Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark
Phan, Anh-Cang
Phan, Thuong-Cang
Cao, Hung-Phi
Trieu, Thanh-Ngoan
APPLIED SCIENCES-BASEL, 2022, 12 (13):
[2] Fast algorithms for large-scale genome alignment and comparison
Delcher, AL
Phillippy, A
Carlton, J
Salzberg, SL
NUCLEIC ACIDS RESEARCH, 2002, 30 (11) : 2478 - 2483
[3] Large-Scale Learning with AdaGrad on Spark
Hadgu, Asmelash Teka
Nigam, Aastha
Diaz-Aviles, Ernesto
PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2828 - 2830
[4] Large-Scale Comparison of Four Binding Site Detection Algorithms
Schmidtke, Peter
Souaille, Catherine
Estienne, Frederic
Baurin, Nicolas
Kroemer, Romano T.
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (12) : 2191 - 2200
[5] Comparison of Large-scale SVM Training Algorithms for Language Recognition
Cumani, Sandro
Castaldo, Fabio
Laface, Pietro
Colibro, Daniele
Vair, Claudio
ODYSSEY 2010: THE SPEAKER AND LANGUAGE RECOGNITION WORKSHOP, 2010, : 222 - 229
[6] Large-Scale Network Embedding in Apache Spark
Lin, Wenqing
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3271 - 3279
[7] Large-Scale Data Pollution with Apache Spark
Hildebrandt, Kai
Panse, Fabian
Wilcke, Niklas
Ritter, Norbert
IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
[8] Processing large-scale data with Apache Spark
Ko, Seyoon
Won, Joong-Ho
KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
[9] Large-scale geographically weighted regression on Spark
Hung Tien Tran
Hiep Tuan Nguyen
Viet-Trung Tran
2016 EIGHTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2016, : 127 - 132
[10] Accelerating Large-Scale Genomic Analysis with Spark
Li, Xueqi
Tan, Guangming
Zhang, Chunming
Li, Xu
Zhang, Zhonghai
Sun, Ninghui
2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 747 - 751

← 1 2 3 4 5 →