A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

被引:0
|
作者
Phan A.-C. [1 ]
Phan T.-C. [2 ]
Trieu T.-N. [2 ]
Tran T.-T.-Q. [3 ]
机构
[1] Vinh Long University of Technology Education, Vinh Long City
[2] Can Tho University, Can Tho City
[3] IRISA, University of Rennes 1, Lannion City
关键词
Big data analytic; Bloom filter; Join operation; MapReduce; Spark;
D O I
10.1007/s42979-021-00738-x
中图分类号
学科分类号
摘要
Currently, the estimated amount of data created daily have reached the threshold of petabytes or even zettabytes globally. It is no wonder that traditional data processing technologies cannot process and manage extremely large volumes of such data. However, these massive and various data can be used to deal with business problems that we would not have been able to tackle before. To discover their value, it is necessary to effectively perform query operations in a parallel and distributed manner. One of the standard and common query operations is an expensive join operation. This research systematically presents a theoretical and experimental comparison of the prominent join algorithms in the Spark environment. At first, this study shows the details of important strategies of two-way joins and recursive joins. Then, it exposes the advantages and disadvantages of each approach. Especially, the work provides mathematical cost models to make a more convince comparison of the joins before verifying by experiments. The results show that the comparison using the cost models is consistent with that using the experiments. Generally, the two-way and recursive joins using filters are the best choices while performing in the Spark environment. © 2021, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 50 条
  • [21] A LARGE-SCALE EXPERIMENTAL COMPARISON OF THE TETRAD AND TRIANGLE TESTS IN CHILDREN
    Garcia, Karen
    Ennis, John M.
    Prinyawiwatkul, Witoon
    JOURNAL OF SENSORY STUDIES, 2012, 27 (04) : 217 - 222
  • [22] PARAFAC algorithms for large-scale problems
    Anh Huy Phan
    Cichocki, Andrzej
    NEUROCOMPUTING, 2011, 74 (11) : 1970 - 1984
  • [23] Algorithms for large-scale flat placement
    Vygen, J
    DESIGN AUTOMATION CONFERENCE - PROCEEDINGS 1997, 1997, : 746 - 751
  • [24] Algorithms for large-scale genotyping microarrays
    Liu, WM
    Di, XJ
    Yang, G
    Matsuzaki, H
    Huang, J
    Mei, R
    Ryder, TB
    Webster, TA
    Dong, SL
    Liu, GY
    Jones, KW
    Kennedy, GC
    Kulp, D
    BIOINFORMATICS, 2003, 19 (18) : 2397 - 2403
  • [25] Optimization Algorithms for Large-Scale Systems
    Azizan N.
    Performance Evaluation Review, 2020, 47 (03): : 2 - 5
  • [26] EXPERIMENTAL COMPARISON OF 2 HEURISTIC ALGORITHMS FOR ONE GENERALIZATION OF THE LARGE-SCALE PLANAR TRAVELING-SALESMAN PROBLEM
    GERSHUNI, DS
    SHERSTYUK, AV
    PROGRAMMING AND COMPUTER SOFTWARE, 1990, 16 (04) : 171 - 173
  • [27] Join Algorithms under Apache Spark: Revisited
    Al-Badarneh, Amer
    PROCEEDINGS OF THE 2019 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND TECHNOLOGY APPLICATIONS (ICCTA 2019), 2019, : 56 - 62
  • [28] Scalable Algorithms for Bayesian Inference of Large-Scale Models from Large-Scale Data
    Ghattas, Omar
    Isaac, Tobin
    Petra, Noemi
    Stadler, Georg
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 3 - 6
  • [29] Appraising SPARK on Large-Scale Social Media Analysis
    Belcastro, Loris
    Marozzo, Fabrizio
    Talia, Domenico
    Trunfio, Paolo
    EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 483 - 495
  • [30] Parallelism and Partitioning in Large-Scale GAs using Spark
    Alterkawi, Laila
    Migliavacca, Matteo
    PROCEEDINGS OF THE 2019 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'19), 2019, : 736 - 744