Automated parallel execution of distributed task graphs with FPGA clusters

被引:0
作者
Ruiz, Juan Miguel de Haro [1 ,2 ]
Martinez, Carlos alvarez [1 ,2 ]
Jimenez-Gonzalez, Daniel [1 ,2 ]
Martorell, Xavier [1 ,2 ]
Ueno, Tomohiro [3 ]
Sano, Kentaro [3 ]
Ringlein, Burkhard [4 ]
Abel, Francois [4 ]
Weiss, Beat [4 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan
[4] IBM Res Europe, Zurich, Switzerland
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷
基金
日本学术振兴会; 欧盟地平线“2020”;
关键词
FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;
D O I
10.1016/j.future.2024.06.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.
引用
收藏
页码:808 / 824
页数:17
相关论文
共 48 条
[41]   Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs [J].
Siegfried Benkner ;
Viera Sipkova .
International Journal of Parallel Programming, 2003, 31 :3-19
[42]   Exploiting distributed-memory and shared-memory parallelism on clusters of SMPs with data parallel programs [J].
Benkner, S ;
Sipkova, V .
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2003, 31 (01) :3-19
[43]   Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis [J].
Gonzalez-Dominguez, Jorge ;
Remeseiro, Beatriz ;
Martin, Maria J. .
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2017, 139 :51-60
[44]   The Impact of Input Error on the Scheduling of Task Graphs with Imprecise Computations in Heterogeneous Distributed Real-Time Systems [J].
Stavrinides, Georgios L. ;
Karatza, Helen D. .
ANALYTICAL AND STOCHASTIC MODELING TECHNIQUES AND APPLICATIONS, (ASMTA 2011), 2011, 6751 :273-287
[45]   Large-Scale Parallel Embedded Computing With Improved-MPI in Off-Chip Distributed Clusters [J].
Wei, Xile ;
Wei, Hengyi ;
Lu, Meili ;
Zhang, Zhen ;
Chang, Siyuan ;
Zeng, Shunqi ;
Wang, Fei ;
Chen, Minghui .
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (11) :12575-12585
[46]   A hierarchical distributed-shared memory parallel Branch&Bound application with PVM and OpenMP for multiprocessor clusters [J].
Aversa, R ;
Di Martino, B ;
Mazzocca, N ;
Venticinque, S .
PARALLEL COMPUTING, 2005, 31 (10-12) :1034-1047
[47]   Scheduling multiple task graphs in heterogeneous distributed real-time systems by exploiting schedule holes with bin packing techniques [J].
Stavrinides, Georgios L. ;
Karatza, Helen D. .
SIMULATION MODELLING PRACTICE AND THEORY, 2011, 19 (01) :540-552
[48]   The Raspberry Pi Education Mine: For Teaching Engineering and Computer Science Students Concepts Like, Computer Clusters, Parallel Computing, and Distributed Computing [J].
Penyala, Heramb ;
Ibrahim, Soad ;
El Mesalami, Ayman .
2020 IEEE INTERNATIONAL CONFERENCE ON ELECTRO INFORMATION TECHNOLOGY (EIT), 2020, :624-628