Automated parallel execution of distributed task graphs with FPGA clusters

被引:0
作者
Ruiz, Juan Miguel de Haro [1 ,2 ]
Martinez, Carlos alvarez [1 ,2 ]
Jimenez-Gonzalez, Daniel [1 ,2 ]
Martorell, Xavier [1 ,2 ]
Ueno, Tomohiro [3 ]
Sano, Kentaro [3 ]
Ringlein, Burkhard [4 ]
Abel, Francois [4 ]
Weiss, Beat [4 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan
[4] IBM Res Europe, Zurich, Switzerland
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷
基金
日本学术振兴会; 欧盟地平线“2020”;
关键词
FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;
D O I
10.1016/j.future.2024.06.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.
引用
收藏
页码:808 / 824
页数:17
相关论文
共 48 条
  • [31] ForestLayer: Efficient training of deep forests on distributed task-parallel platforms
    Zhu, Guanghui
    Hu, Qiu
    Gu, Rong
    Yuan, Chunfeng
    Huang, Yihua
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 132 : 113 - 126
  • [32] DuctTeip: An efficient programming model for distributed task-based parallel computing
    Zafari, Afshin
    Larsson, Elisabeth
    Tillenius, Martin
    [J]. PARALLEL COMPUTING, 2019, 90
  • [33] Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments
    Ejarque, Jorge
    Bertran, Marta
    Cid-Fuentes, Javier Alvarez
    Conejero, Javier
    Badia, Rosa M.
    [J]. EURO-PAR 2020: PARALLEL PROCESSING, 2020, 12247 : 411 - 425
  • [34] PIPES: A Language and Compiler for Task-based Programming on Distributed-Memory Clusters
    Kong, Martin
    Pouchet, Louis-Noel
    Sadayappan, P.
    Sarkar, Vivek
    [J]. SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2016, : 456 - 467
  • [35] A visual performance analysis framework for task-based parallel applications running on hybrid clusters
    Pinto, Vinicius Garcia
    Schnorr, Lucas Mello
    Stanisic, Luka
    Legrand, Arnaud
    Thibault, Samuel
    Danjean, Vincent
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (18)
  • [36] PARALLEL AND DISTRIBUTED SEISMIC WAVE FIELD MODELING WITH COMBINED LINUX CLUSTERS AND GRAPHICS PROCESSING UNITS
    Danek, Tomasz
    [J]. 2009 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOLS 1-5, 2009, : 2588 - 2591
  • [37] DyG-DPCD: A Distributed Parallel Community Detection Algorithm for Large-Scale Dynamic Graphs
    Sattar, Naw Safrin
    Ibrahim, Khaled Z.
    Buluc, Aydin
    Arifuzzaman, Shaikh
    [J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2025, 53 (01)
  • [38] Witelo: Automated generation and timing characterization of distributed-control macroblocks for high-performance FPGA designs
    Sierra, Roberto
    Carreras, Carlos
    Caffarena, Gabriel
    [J]. INTEGRATION-THE VLSI JOURNAL, 2019, 68 : 1 - 11
  • [39] Scheduling Distributed Clusters of Parallel Machines : Primal-Dual and LP-based Approximation Algorithms
    Murray, Riley
    Khuller, Samir
    Chao, Megan
    [J]. ALGORITHMICA, 2018, 80 (10) : 2777 - 2798
  • [40] Scheduling Distributed Clusters of Parallel Machines : Primal-Dual and LP-based Approximation Algorithms
    Riley Murray
    Samir Khuller
    Megan Chao
    [J]. Algorithmica, 2018, 80 : 2777 - 2798