Automated parallel execution of distributed task graphs with FPGA clusters

被引：0

作者：

Ruiz, Juan Miguel de Haro ^{[1
,2
]}

Martinez, Carlos alvarez ^{[1
,2
]}

Jimenez-Gonzalez, Daniel ^{[1
,2
]}

Martorell, Xavier ^{[1
,2
]}

Ueno, Tomohiro ^{[3
]}

Sano, Kentaro ^{[3
]}

Ringlein, Burkhard ^{[4
]}

Abel, Francois ^{[4
]}

Weiss, Beat ^{[4
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] Univ Politecn Cataluna, Barcelona, Spain

[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan

[4] IBM Res Europe, Zurich, Switzerland

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷

基金：

日本学术振兴会; 欧盟地平线“2020”;

关键词：

FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;

D O I：

10.1016/j.future.2024.06.041

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

引用

页码：808 / 824

页数：17

共 48 条

[1] Hardware Implementation on FPGA for Task-Level Parallel Dataflow Execution Engine
Wang, Chao
Zhang, Junneng
Li, Xi
Wang, Aili
Zhou, Xuehai
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (08) : 2303 - 2315
[2] HyperQueue: Efficient and ergonomic task graphs on HPC clusters
Beranek, Jakub
Bohm, Ada
Palermo, Gianluca
Martinovic, Jan
Jansik, Branislav
SOFTWAREX, 2024, 27
[3] Task assignment heuristics for parallel and distributed CFD applications
Lopez-Benitez, Noe
Djomehri, M. Jahed
Biswas, Rupak
INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2007, 3 (02) : 155 - 165
[4] Combining execution pipelines to improve parallel implementation of HMMER on FPGA
Abbas, Naeem
Derrien, Steven
Rajopadhye, Sanjay
Quinton, Patrice
Cornu, Alexandre
Lavenier, Dominique
MICROPROCESSORS AND MICROSYSTEMS, 2015, 39 (07) : 457 - 470
[5] From Serial Loops to Parallel Execution on Distributed Systems
Bosilca, George
Bouteiller, Aurelien
Danalis, Anthony
Herault, Thomas
Dongarra, Jack
EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 246 - 257
[6] Teaching Parallel and Distributed Computing with MPI on Raspberry Pi Clusters
Brown, Richard
Adams, Joel
Matthews, Suzanne
Shoop, Elizabeth
SIGCSE'18: PROCEEDINGS OF THE 49TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, 2018, : 1054 - 1054
[7] Optimal scheduling of task graphs on parallel systems
Sinnen, Oliver
Kozlov, Alexci Vladimirovich
Shahul, Ahmed Zaki Semar
PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING AND NETWORKS, 2007, : 170 - +
[8] Optimal Scheduling of Task Graphs on Parallel Systems
Shahul, Ahmed Zaki Semar
Sinnen, Oliver
PDCAT 2008: NINTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2008, : 323 - +
[9] An Approach Towards Distributed DNN Training on FPGA Clusters
Kreowsky, Philipp
Knapheide, Justin
Stabernack, Benno
ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2024, 2024, 14842 : 18 - 32
[10] HyperShell v2: Distributed Task Execution for HPC
Lentner, Geoffrey
Gorenstein, Lev
PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2022, 2022,

← 1 2 3 4 5 →