Automated parallel execution of distributed task graphs with FPGA clusters

被引：0

作者：

Ruiz, Juan Miguel de Haro ^{[1
,2
]}

Martinez, Carlos alvarez ^{[1
,2
]}

Jimenez-Gonzalez, Daniel ^{[1
,2
]}

Martorell, Xavier ^{[1
,2
]}

Ueno, Tomohiro ^{[3
]}

Sano, Kentaro ^{[3
]}

Ringlein, Burkhard ^{[4
]}

Abel, Francois ^{[4
]}

Weiss, Beat ^{[4
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] Univ Politecn Cataluna, Barcelona, Spain

[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan

[4] IBM Res Europe, Zurich, Switzerland

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷

基金：

日本学术振兴会; 欧盟地平线“2020”;

关键词：

FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;

D O I：

10.1016/j.future.2024.06.041

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

引用

页码：808 / 824

页数：17

共 48 条

[21] SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters
Cao, Jing
Zhu, Zongwei
Zhou, Xuehai
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, : 94 - 102
[22] Automated prioritizing heuristics for parallel task graph scheduling in heterogeneous computing
Flint C.
Paillat L.
Bramas B.
PeerJ Computer Science, 2022, 8
[23] Low Power FIR Filter implementation on FPGA using Parallel Distributed Arithmetic
Khan, Shaheen
Jaffery, Zainul Abdin
2015 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2015,
[24] Automated prioritizing heuristics for parallel task graph scheduling in heterogeneous computing
Flint, Clement
Paillat, Ludovic
Bramas, Berenger
PEERJ COMPUTER SCIENCE, 2022, 8
[25] OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures
Jacobs, Christian T.
Jammy, Satya P.
Sandham, Neil D.
JOURNAL OF COMPUTATIONAL SCIENCE, 2017, 18 : 12 - 23
[26] Parallel NGS Assembly Using Distributed Assembly Graphs Enriched with Biological Knowledge
Warnke-Sommer, Julia D.
Ali, Hesham H.
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 273 - 282
[27] The TaPaSCo Open-Source Toolflowfor the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems
Carsten Heinz
Jaco Hofmann
Jens Korinth
Lukas Sommer
Lukas Weber
Andreas Koch
Journal of Signal Processing Systems, 2021, 93 : 545 - 563
[28] The TaPaSCo Open-Source Toolflow for the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems
Heinz, Carsten
Hofmann, Jaco
Korinth, Jens
Sommer, Lukas
Weber, Lukas
Koch, Andreas
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2021, 93 (05): : 545 - 563
[29] An efficient parallel algorithm of N-hop neighborhoods on graphs in distributed environment
Wenjie Liu
Zhanhuai Li
Frontiers of Computer Science, 2019, 13 : 1309 - 1325
[30] An efficient parallel algorithm of N-hop neighborhoods on graphs in distributed environment
Liu, Wenjie
Li, Zhanhuai
FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (06) : 1309 - 1325

← 1 2 3 4 5 →