In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2023年 / 149卷
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
[41]   An In-Depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing [J].
Zhang, Jie ;
Jung, Myoungsoo .
NETWORK AND PARALLEL COMPUTING (NPC 2017), 2017, 10578 :155-159
[42]   A Review of High-Performance Computing Methods for Power Flow Analysis [J].
Alawneh, Shadi G. ;
Zeng, Lei ;
Arefifar, Seyed Ali .
MATHEMATICS, 2023, 11 (11)
[43]   Analysis of scalability of high-performance 3D image processing platform for virtual colonoscopy [J].
Yoshida, Hiroyuki ;
Wu, Yin ;
Cai, Wenli .
MEDICAL IMAGING 2014: PACS AND IMAGING INFORMATICS: NEXT GENERATION AND INNOVATIONS, 2014, 9039
[44]   High-Performance Complex Event Processing over Hierarchical Data [J].
Mozafari, Barzan ;
Zeng, Kai ;
D'Antoni, Loris ;
Zaniolo, Carlo .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (04) :1
[45]   High-performance Processing of Covariance Matrices Using GPU Computations [J].
K. Yu. Erofeev ;
E. M. Khramchenkov ;
E. V. Biryal’tsev .
Lobachevskii Journal of Mathematics, 2019, 40 :547-554
[46]   High-performance Processing of Covariance Matrices Using GPU Computations [J].
Erofeev, K. Yu. ;
Khramchenkov, E. M. ;
Biryal'tsev, E. V. .
LOBACHEVSKII JOURNAL OF MATHEMATICS, 2019, 40 (05) :547-554
[47]   A parallel generalized relaxation method for high-performance image segmentation on GPUs [J].
D'Ambra, Pasqua ;
Filippone, Salvatore .
JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2016, 293 :35-44
[48]   Mobiliti: Scalable Transportation Simulation Using High-Performance Parallel Computing [J].
Chan, Cy ;
Wang, Bin ;
Bachan, John ;
Macfarlane, Jane .
2018 21ST INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2018, :634-641
[49]   A Study of Failure Recovery and Logging of High-Performance Parallel File Systems [J].
Han, Runzhou ;
Gatla, Om Rameshwar ;
Zheng, Mai ;
Cao, Jinrui ;
Zhang, Di ;
Dai, Dong ;
Chen, Yong ;
Cook, Jonathan .
ACM TRANSACTIONS ON STORAGE, 2022, 18 (02)
[50]   High Performance Computing Applications Using Parallel Data Processing Units [J].
Azadbakht, Keyvan ;
Serbanescu, Vlad ;
de Boer, Frank .
FUNDAMENTALS OF SOFTWARE ENGINEERING, FSEN 2015, 2015, 9392 :191-206