In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2023年 / 149卷
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [21] Energy Efficient Job Co-Scheduling for High-Performance Parallel Computing Clusters
    Newsom, David K.
    Serres, Olivier
    Azari, Sardar F.
    Badawy, Abdel-Hameed A.
    El-Ghazawi, Tarek
    2015 IEEE INTERNATIONAL CONFERENCE ON SMART CITY/SOCIALCOM/SUSTAINCOM (SMARTCITY), 2015, : 550 - 556
  • [22] Fully Parallel Optimization of Coordinated Electricity and Natural Gas Systems on High-Performance Computing
    Gong, Lin
    Peng, Yehong
    Zhang, Chenxu
    Fu, Yong
    IEEE TRANSACTIONS ON SMART GRID, 2023, 14 (05) : 3499 - 3511
  • [23] Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice
    Jauk, David
    Yang, Dai
    Schulz, Martin
    PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2019,
  • [24] Addressing large scale patterns of no-flow events in rivers: An in-depth analysis with Achelous software
    Papadaki, Christina
    Mitropoulos, Pantelis
    Panagopoulos, Yiannis
    Dimitriou, Elias
    JOURNAL OF HYDROLOGY, 2024, 645
  • [25] High-performance attribute reduction on graphics processing unit
    Jing, Si-Yuan
    Yang, Jun
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2020, 32 (06) : 977 - 996
  • [26] Adaptive Fault Management of Parallel Applications for High-Performance Computing
    Lan, Zhiling
    Li, Yawei
    IEEE TRANSACTIONS ON COMPUTERS, 2008, 57 (12) : 1647 - 1660
  • [27] A high-performance parallel implementation of the certified reduced basis method
    Knezevic, David J.
    Peterson, John W.
    COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING, 2011, 200 (13-16) : 1455 - 1466
  • [28] A Review on Parallel Virtual Screening Softwares for High-Performance Computers
    Murugan, Natarajan Arul
    Podobas, Artur
    Gadioli, Davide
    Vitali, Emanuele
    Palermo, Gianluca
    Markidis, Stefano
    PHARMACEUTICALS, 2022, 15 (01)
  • [29] High-performance and balanced parallel graph coloring on multicore platforms
    Christina Giannoula
    Athanasios Peppas
    Georgios Goumas
    Nectarios Koziris
    The Journal of Supercomputing, 2023, 79 : 6373 - 6421
  • [30] A Heterogeneous Supercomputer Model for High-Performance Parallel Computing Pedagogy
    Wolfer, James
    PROCEEDINGS OF 2015 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE (EDUCON), 2015, : 799 - 805