A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

被引:0
|
作者
Phan A.-C. [1 ]
Phan T.-C. [2 ]
Trieu T.-N. [2 ]
Tran T.-T.-Q. [3 ]
机构
[1] Vinh Long University of Technology Education, Vinh Long City
[2] Can Tho University, Can Tho City
[3] IRISA, University of Rennes 1, Lannion City
关键词
Big data analytic; Bloom filter; Join operation; MapReduce; Spark;
D O I
10.1007/s42979-021-00738-x
中图分类号
学科分类号
摘要
Currently, the estimated amount of data created daily have reached the threshold of petabytes or even zettabytes globally. It is no wonder that traditional data processing technologies cannot process and manage extremely large volumes of such data. However, these massive and various data can be used to deal with business problems that we would not have been able to tackle before. To discover their value, it is necessary to effectively perform query operations in a parallel and distributed manner. One of the standard and common query operations is an expensive join operation. This research systematically presents a theoretical and experimental comparison of the prominent join algorithms in the Spark environment. At first, this study shows the details of important strategies of two-way joins and recursive joins. Then, it exposes the advantages and disadvantages of each approach. Especially, the work provides mathematical cost models to make a more convince comparison of the joins before verifying by experiments. The results show that the comparison using the cost models is consistent with that using the experiments. Generally, the two-way and recursive joins using filters are the best choices while performing in the Spark environment. © 2021, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 50 条
  • [1] Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark
    Phan, Anh-Cang
    Phan, Thuong-Cang
    Cao, Hung-Phi
    Trieu, Thanh-Ngoan
    APPLIED SCIENCES-BASEL, 2022, 12 (13):
  • [2] Fast algorithms for large-scale genome alignment and comparison
    Delcher, AL
    Phillippy, A
    Carlton, J
    Salzberg, SL
    NUCLEIC ACIDS RESEARCH, 2002, 30 (11) : 2478 - 2483
  • [3] Large-Scale Learning with AdaGrad on Spark
    Hadgu, Asmelash Teka
    Nigam, Aastha
    Diaz-Aviles, Ernesto
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2828 - 2830
  • [4] Large-Scale Comparison of Four Binding Site Detection Algorithms
    Schmidtke, Peter
    Souaille, Catherine
    Estienne, Frederic
    Baurin, Nicolas
    Kroemer, Romano T.
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (12) : 2191 - 2200
  • [5] Comparison of Large-scale SVM Training Algorithms for Language Recognition
    Cumani, Sandro
    Castaldo, Fabio
    Laface, Pietro
    Colibro, Daniele
    Vair, Claudio
    ODYSSEY 2010: THE SPEAKER AND LANGUAGE RECOGNITION WORKSHOP, 2010, : 222 - 229
  • [6] Large-Scale Network Embedding in Apache Spark
    Lin, Wenqing
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3271 - 3279
  • [7] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [8] Processing large-scale data with Apache Spark
    Ko, Seyoon
    Won, Joong-Ho
    KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
  • [9] Large-scale geographically weighted regression on Spark
    Hung Tien Tran
    Hiep Tuan Nguyen
    Viet-Trung Tran
    2016 EIGHTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2016, : 127 - 132
  • [10] Accelerating Large-Scale Genomic Analysis with Spark
    Li, Xueqi
    Tan, Guangming
    Zhang, Chunming
    Li, Xu
    Zhang, Zhonghai
    Sun, Ninghui
    2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 747 - 751