In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
|
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2023年 / 149卷
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [1] Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters
    Kousha, Pouya
    Ramesh, Bharath
    Suresh, Kaushik Kandadi
    Chu, Ching-Hsiang
    Jain, Arpan
    Sarkauskas, Nick
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 93 - 102
  • [2] Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
    Ben-Nun, Tal
    Hoefler, Torsten
    ACM COMPUTING SURVEYS, 2019, 52 (04)
  • [3] A High-Performance Parallel Approach to Image Processing in Distributed Computing
    Rakhimov, Mekhriddin
    Mamadjanov, Doniyor
    Mukhiddinov, Abulkosim
    2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020), 2020,
  • [4] High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality
    Besta, Maciej
    Carigiet, Armon
    Janda, Kacper
    Vonarburg-Shmaria, Zur
    Gianinazzi, Lukas
    Hoefler, Torsten
    PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [5] A Parallel Sliding-Window Generator for High-Performance Digital-Signal Processing on FPGAs
    Stitt, Greg
    Schwartz, Eric
    Cooke, Patrick
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2016, 9 (03)
  • [6] Analysis and comparison of high-performance computing solvers for minimisation problems in signal processing
    Cammarasana, Simone
    Patane, Giuseppe
    MATHEMATICS AND COMPUTERS IN SIMULATION, 2025, 299 : 525 - 538
  • [7] High-performance parallel implicit CFD
    Gropp, WD
    Kaushik, DK
    Keyes, DE
    Smith, BF
    PARALLEL COMPUTING, 2001, 27 (04) : 337 - 362
  • [8] Parallel Colt: A High-Performance Java']Java Library for Scientific Computing and Image Processing
    Wendykier, Piotr
    Nagy, James G.
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2010, 37 (03):
  • [9] Accelerating single molecule localization microscopy through parallel processing on a high-performance computing cluster
    Munro, I.
    Garcia, E.
    Yan, M.
    Guldbrand, S.
    Kumar, S.
    Kwakwa, K.
    Dunsby, C.
    Neil, M. A. A.
    French, P. M. W.
    JOURNAL OF MICROSCOPY, 2019, 273 (02) : 148 - 160
  • [10] Parallel Processing Techniques For High Performance Image Processing Applications
    Hemnani, Monika
    2016 IEEE STUDENTS' CONFERENCE ON ELECTRICAL, ELECTRONICS AND COMPUTER SCIENCE (SCEECS), 2016,