Automated parallel execution of distributed task graphs with FPGA clusters

被引：0

作者：

Ruiz, Juan Miguel de Haro ^{[1
,2
]}

Martinez, Carlos alvarez ^{[1
,2
]}

Jimenez-Gonzalez, Daniel ^{[1
,2
]}

Martorell, Xavier ^{[1
,2
]}

Ueno, Tomohiro ^{[3
]}

Sano, Kentaro ^{[3
]}

Ringlein, Burkhard ^{[4
]}

Abel, Francois ^{[4
]}

Weiss, Beat ^{[4
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] Univ Politecn Cataluna, Barcelona, Spain

[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan

[4] IBM Res Europe, Zurich, Switzerland

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷

基金：

日本学术振兴会; 欧盟地平线“2020”;

关键词：

FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;

D O I：

10.1016/j.future.2024.06.041

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

引用

页码：808 / 824

页数：17

共 48 条

[41] Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs [J].

Siegfried Benkner ;

Viera Sipkova .

International Journal of Parallel Programming, 2003, 31 :3-19

[42] Exploiting distributed-memory and shared-memory parallelism on clusters of SMPs with data parallel programs [J].

Benkner, S ;

Sipkova, V .

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2003, 31 (01) :3-19

[43] Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis [J].

Gonzalez-Dominguez, Jorge ;

Remeseiro, Beatriz ;

Martin, Maria J. .

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2017, 139 :51-60

[44] The Impact of Input Error on the Scheduling of Task Graphs with Imprecise Computations in Heterogeneous Distributed Real-Time Systems [J].

Stavrinides, Georgios L. ;

Karatza, Helen D. .

ANALYTICAL AND STOCHASTIC MODELING TECHNIQUES AND APPLICATIONS, (ASMTA 2011), 2011, 6751 :273-287

[45] Large-Scale Parallel Embedded Computing With Improved-MPI in Off-Chip Distributed Clusters [J].

Wei, Xile ;

Wei, Hengyi ;

Lu, Meili ;

Zhang, Zhen ;

Chang, Siyuan ;

Zeng, Shunqi ;

Wang, Fei ;

Chen, Minghui .

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (11) :12575-12585

[46] A hierarchical distributed-shared memory parallel Branch&Bound application with PVM and OpenMP for multiprocessor clusters [J].

Aversa, R ;

Di Martino, B ;

Mazzocca, N ;

Venticinque, S .

PARALLEL COMPUTING, 2005, 31 (10-12) :1034-1047

[47] Scheduling multiple task graphs in heterogeneous distributed real-time systems by exploiting schedule holes with bin packing techniques [J].

Stavrinides, Georgios L. ;

Karatza, Helen D. .

SIMULATION MODELLING PRACTICE AND THEORY, 2011, 19 (01) :540-552

[48] The Raspberry Pi Education Mine: For Teaching Engineering and Computer Science Students Concepts Like, Computer Clusters, Parallel Computing, and Distributed Computing [J].

Penyala, Heramb ;

Ibrahim, Soad ;

El Mesalami, Ayman .

2020 IEEE INTERNATIONAL CONFERENCE ON ELECTRO INFORMATION TECHNOLOGY (EIT), 2020, :624-628

← 1 2 3 4 5 →