Join Algorithms under Apache Spark: Revisited

被引:0
|
作者
Al-Badarneh, Amer [1 ]
机构
[1] Jordan Univ Sci & Technol, Comp Informat Syst Dept, Irbid 22110, Jordan
来源
PROCEEDINGS OF THE 2019 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND TECHNOLOGY APPLICATIONS (ICCTA 2019) | 2019年
关键词
Mapreduce; Join Algorithm; Apache Spark; Hadoop; MAPREDUCE;
D O I
10.1145/3323933.3324094
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the communication cost. Apache Spark is highly scalable, fault-tolerance, and can be used across many computers. So join algorithm is one of the most widely used algorithms in database systems, but it is also a heavily time consuming operation. In this work, we will survey and criticize several implementations of Spark join algorithms and discuss their strengths and weaknesses, present a detailed comparison of these algorithms, and introduce optimization approaches to enhance and tune the performance of join algorithms.
引用
收藏
页码:56 / 62
页数:7
相关论文
共 50 条
  • [1] Spatio-Temporal Join on Apache Spark
    Whitman, Randall T.
    Park, Michael B.
    Marsh, Bryan G.
    Hoel, Erik G.
    25TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2017), 2017,
  • [2] Distributed Spatial and Spatio-Temporal Join on Apache Spark
    Whitman, Randall T.
    Marsh, Bryan G.
    Park, Michael B.
    Hoel, Erik G.
    ACM TRANSACTIONS ON SPATIAL ALGORITHMS AND SYSTEMS, 2019, 5 (01)
  • [3] SliceNBound: Solving Closest Pairs and Distance Join Queries in Apache Spark
    Mavrommatis, George
    Moutafis, Panagiotis
    Vassilakopoulos, Michael
    Garcia-Garcia, Francisco
    Corral, Antonio
    ADVANCES IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2017, 2017, 10509 : 199 - 213
  • [4] Efficient Execution of Dynamic Programming Algorithms on Apache Spark
    Javanmard, Mohammad Mahdi
    Ahmad, Zafar
    Zola, Jaroslaw
    Pouchet, Louis-Noel
    Chowdhury, Rezaul
    Harrison, Robert
    2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020), 2020, : 337 - 348
  • [5] Comparative Study of Apache Spark MLlib Clustering Algorithms
    Harifi, Sasan
    Byagowi, Ebrahim
    Khalilian, Madjid
    DATA MINING AND BIG DATA, DMBD 2017, 2017, 10387 : 61 - 73
  • [6] Statistical Analysis of the Performance of Four Apache Spark ML Algorithms
    Camele, Genaro
    Hasperue, Waldo
    Ronchetti, Franco
    Quiroga, Facundo Manuel
    JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2022, 22 (02): : 175 - 182
  • [7] Network Intrusion Detection on Apache Spark with Machine Learning Algorithms
    Kurt, Elif Merve
    Becerikli, Yasar
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2018, 2018, 893 : 130 - 141
  • [8] Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark
    Semenov, A.
    Mazeev, A.
    Doropheev, D.
    Yusubaliev, T.
    LOBACHEVSKII JOURNAL OF MATHEMATICS, 2018, 39 (09) : 1262 - 1269
  • [9] Evaluation of classification algorithms for banking customer's behavior under Apache Spark Data Processing System
    Etaiwi, Wael
    Biltawi, Mariam
    Naymat, Ghazi
    8TH INTERNATIONAL CONFERENCE ON EMERGING UBIQUITOUS SYSTEMS AND PERVASIVE NETWORKS (EUSPN 2017) / 7TH INTERNATIONAL CONFERENCE ON CURRENT AND FUTURE TRENDS OF INFORMATION AND COMMUNICATION TECHNOLOGIES IN HEALTHCARE (ICTH-2017) / AFFILIATED WORKSHOPS, 2017, 113 : 559 - 564
  • [10] Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection
    Dobson, Anthony
    Roy, Kaushik
    Yuan, Xiaohong
    Xu, Jinsheng
    2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 374 - 379