In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
|
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2023年 / 149卷
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [11] High-Performance Computing with Quantum Processing Units
    Britt, Keith A.
    Humble, Travis S.
    ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2017, 13 (03)
  • [12] A NEW FRAMEWORK OF CLUSTER-BASED PARALLEL PROCESSING SYSTEM FOR HIGH-PERFORMANCE GEO-COMPUTING
    Ma, Yan
    Liu, Dingsheng
    Li, Jingshan
    2009 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOLS 1-5, 2009, : 2429 - 2432
  • [13] GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis
    Ghiasi, Nika Mansouri
    Park, Jisung
    Mustafa, Harun
    Kim, Jeremie
    Olgun, Ataberk
    Gollwitzer, Arvid
    Cali, Damla Senol
    Firtina, Can
    Mao, Haiyu
    Alserr, Nour Almadhoun
    Ausavarungnirun, Rachata
    Vijaykumar, Nandita
    Alser, Mohammed
    Mutlu, Onur
    ASPLOS '22: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2022, : 635 - 654
  • [14] Revisiting the parallel tempering algorithm: High-performance computing and applications in operations research
    Almeida, Andre Luis Barroso
    Lima, Joubert de Castro
    Carvalho, Marco Antonio Moreira
    COMPUTERS & OPERATIONS RESEARCH, 2025, 178
  • [15] A universal parallel simulation framework for energy pipeline networks on high-performance computers
    Han, Pu
    Hua, Haobo
    Wang, Hai
    Xue, Fei
    Wu, Changmao
    Shang, Jiandong
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (10) : 14085 - 14115
  • [16] Optimizing Virtual Power Plants with Parallel Simulated Annealing on High-Performance Computing
    Abbasi, Ali
    Alves, Filipe
    Ribeiro, Rui A.
    Sobral, Joao L.
    Rodrigues, Ricardo
    SMART CITIES, 2025, 8 (02):
  • [17] Synthesis of high-performance packet processing pipelines
    Soviani, Cristian
    Hadzic, Llija
    Edwards, Stephen A.
    43RD DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 2006, 2006, : 679 - +
  • [18] High-Performance Design Patterns and File Formats for Side-Channel Analysis
    Bosland J.
    Ene S.
    Baumgartner P.
    Immler V.
    IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024, 2024 (02): : 769 - 794
  • [19] Performance-based parallel application toolkit for high-performance clusters
    Li, Kuan-Ching
    Weng, Tien-Hsiung
    JOURNAL OF SUPERCOMPUTING, 2009, 48 (01) : 43 - 65
  • [20] Performance-based parallel application toolkit for high-performance clusters
    Kuan-Ching Li
    Tien-Hsiung Weng
    The Journal of Supercomputing, 2009, 48 : 43 - 65