In-depth analysis on parallel processing patterns for high-performance Dataframes

被引：0

作者：

Perera, Niranda ^{[1
]}

Sarker, Arup Kumar ^{[2
,3
]}

Staylor, Mills ^{[2
]}

von Laszewski, Gregor ^{[3
]}

Shan, Kaiying ^{[2
]}

Kamburugamuve, Supun ^{[1
]}

Widanage, Chathura ^{[1
]}

Abeykoon, Vibhatha ^{[1
]}

Kanewela, Thejaka Amila ^{[1
]}

Fox, Geoffrey ^{[2
,3
]}

机构：

[1] Indiana Univ Alumni, Bloomington, IN 47405 USA

[2] Univ Virginia, Charlottesville, VA 22904 USA

[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2023年 / 149卷

关键词：

Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;

D O I：

10.1016/j.future.2023.07.007

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.

引用

页码：250 / 264

页数：15

共 50 条

[11] Parallel Processing Techniques For High Performance Image Processing Applications [J].

Hemnani, Monika .

2016 IEEE STUDENTS' CONFERENCE ON ELECTRICAL, ELECTRONICS AND COMPUTER SCIENCE (SCEECS), 2016,

[12] High-Performance Computing with Quantum Processing Units [J].

Britt, Keith A. ;

Humble, Travis S. .

ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2017, 13 (03)

[13] Synthesis of high-performance packet processing pipelines [J].

Soviani, Cristian ;

Hadzic, Llija ;

Edwards, Stephen A. .

43RD DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 2006, 2006, :679-+

[14] A NEW FRAMEWORK OF CLUSTER-BASED PARALLEL PROCESSING SYSTEM FOR HIGH-PERFORMANCE GEO-COMPUTING [J].

Ma, Yan ;

Liu, Dingsheng ;

Li, Jingshan .

2009 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOLS 1-5, 2009, :2429-2432

[15] GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis [J].

Ghiasi, Nika Mansouri ;

Park, Jisung ;

Mustafa, Harun ;

Kim, Jeremie ;

Olgun, Ataberk ;

Gollwitzer, Arvid ;

Cali, Damla Senol ;

Firtina, Can ;

Mao, Haiyu ;

Alserr, Nour Almadhoun ;

Ausavarungnirun, Rachata ;

Vijaykumar, Nandita ;

Alser, Mohammed ;

Mutlu, Onur .

ASPLOS '22: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2022, :635-654

[16] High-Performance Design Patterns and File Formats for Side-Channel Analysis [J].

Bosland J. ;

Ene S. ;

Baumgartner P. ;

Immler V. .

IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024, 2024 (02) :769-794

[17] Optimizing Virtual Power Plants with Parallel Simulated Annealing on High-Performance Computing [J].

Abbasi, Ali ;

Alves, Filipe ;

Ribeiro, Rui A. ;

Sobral, Joao L. ;

Rodrigues, Ricardo .

SMART CITIES, 2025, 8 (02)

[18] A universal parallel simulation framework for energy pipeline networks on high-performance computers [J].

Han, Pu ;

Hua, Haobo ;

Wang, Hai ;

Xue, Fei ;

Wu, Changmao ;

Shang, Jiandong .

JOURNAL OF SUPERCOMPUTING, 2024, 80 (10) :14085-14115

[19] Revisiting the parallel tempering algorithm: High-performance computing and applications in operations research [J].

Almeida, Andre Luis Barroso ;

Lima, Joubert de Castro ;

Carvalho, Marco Antonio Moreira .

COMPUTERS & OPERATIONS RESEARCH, 2025, 178

[20] Performance-based parallel application toolkit for high-performance clusters [J].

Li, Kuan-Ching ;

Weng, Tien-Hsiung .

JOURNAL OF SUPERCOMPUTING, 2009, 48 (01) :43-65

← 1 2 3 4 5 →