Automated parallel execution of distributed task graphs with FPGA clusters

被引：0

作者：

Ruiz, Juan Miguel de Haro ^{[1
,2
]}

Martinez, Carlos alvarez ^{[1
,2
]}

Jimenez-Gonzalez, Daniel ^{[1
,2
]}

Martorell, Xavier ^{[1
,2
]}

Ueno, Tomohiro ^{[3
]}

Sano, Kentaro ^{[3
]}

Ringlein, Burkhard ^{[4
]}

Abel, Francois ^{[4
]}

Weiss, Beat ^{[4
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] Univ Politecn Cataluna, Barcelona, Spain

[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan

[4] IBM Res Europe, Zurich, Switzerland

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷

基金：

日本学术振兴会; 欧盟地平线“2020”;

关键词：

FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;

D O I：

10.1016/j.future.2024.06.041

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

引用

页码：808 / 824

页数：17

共 48 条

[31] ForestLayer: Efficient training of deep forests on distributed task-parallel platforms
Zhu, Guanghui
Hu, Qiu
Gu, Rong
Yuan, Chunfeng
Huang, Yihua
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 132 : 113 - 126
[32] DuctTeip: An efficient programming model for distributed task-based parallel computing
Zafari, Afshin
Larsson, Elisabeth
Tillenius, Martin
[J]. PARALLEL COMPUTING, 2019, 90
[33] Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments
Ejarque, Jorge
Bertran, Marta
Cid-Fuentes, Javier Alvarez
Conejero, Javier
Badia, Rosa M.
[J]. EURO-PAR 2020: PARALLEL PROCESSING, 2020, 12247 : 411 - 425
[34] PIPES: A Language and Compiler for Task-based Programming on Distributed-Memory Clusters
Kong, Martin
Pouchet, Louis-Noel
Sadayappan, P.
Sarkar, Vivek
[J]. SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2016, : 456 - 467
[35] A visual performance analysis framework for task-based parallel applications running on hybrid clusters
Pinto, Vinicius Garcia
Schnorr, Lucas Mello
Stanisic, Luka
Legrand, Arnaud
Thibault, Samuel
Danjean, Vincent
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (18)
[36] PARALLEL AND DISTRIBUTED SEISMIC WAVE FIELD MODELING WITH COMBINED LINUX CLUSTERS AND GRAPHICS PROCESSING UNITS
Danek, Tomasz
[J]. 2009 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOLS 1-5, 2009, : 2588 - 2591
[37] DyG-DPCD: A Distributed Parallel Community Detection Algorithm for Large-Scale Dynamic Graphs
Sattar, Naw Safrin
Ibrahim, Khaled Z.
Buluc, Aydin
Arifuzzaman, Shaikh
[J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2025, 53 (01)
[38] Witelo: Automated generation and timing characterization of distributed-control macroblocks for high-performance FPGA designs
Sierra, Roberto
Carreras, Carlos
Caffarena, Gabriel
[J]. INTEGRATION-THE VLSI JOURNAL, 2019, 68 : 1 - 11
[39] Scheduling Distributed Clusters of Parallel Machines : Primal-Dual and LP-based Approximation Algorithms
Murray, Riley
Khuller, Samir
Chao, Megan
[J]. ALGORITHMICA, 2018, 80 (10) : 2777 - 2798
[40] Scheduling Distributed Clusters of Parallel Machines : Primal-Dual and LP-based Approximation Algorithms
Riley Murray
Samir Khuller
Megan Chao
[J]. Algorithmica, 2018, 80 : 2777 - 2798

← 1 2 3 4 5 →