Automated parallel execution of distributed task graphs with FPGA clusters

被引:0
作者
Ruiz, Juan Miguel de Haro [1 ,2 ]
Martinez, Carlos alvarez [1 ,2 ]
Jimenez-Gonzalez, Daniel [1 ,2 ]
Martorell, Xavier [1 ,2 ]
Ueno, Tomohiro [3 ]
Sano, Kentaro [3 ]
Ringlein, Burkhard [4 ]
Abel, Francois [4 ]
Weiss, Beat [4 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan
[4] IBM Res Europe, Zurich, Switzerland
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷
基金
日本学术振兴会; 欧盟地平线“2020”;
关键词
FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;
D O I
10.1016/j.future.2024.06.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.
引用
收藏
页码:808 / 824
页数:17
相关论文
共 48 条
  • [21] SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters
    Cao, Jing
    Zhu, Zongwei
    Zhou, Xuehai
    2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, : 94 - 102
  • [22] Automated prioritizing heuristics for parallel task graph scheduling in heterogeneous computing
    Flint C.
    Paillat L.
    Bramas B.
    PeerJ Computer Science, 2022, 8
  • [23] Low Power FIR Filter implementation on FPGA using Parallel Distributed Arithmetic
    Khan, Shaheen
    Jaffery, Zainul Abdin
    2015 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2015,
  • [24] Automated prioritizing heuristics for parallel task graph scheduling in heterogeneous computing
    Flint, Clement
    Paillat, Ludovic
    Bramas, Berenger
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [25] OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures
    Jacobs, Christian T.
    Jammy, Satya P.
    Sandham, Neil D.
    JOURNAL OF COMPUTATIONAL SCIENCE, 2017, 18 : 12 - 23
  • [26] Parallel NGS Assembly Using Distributed Assembly Graphs Enriched with Biological Knowledge
    Warnke-Sommer, Julia D.
    Ali, Hesham H.
    2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 273 - 282
  • [27] The TaPaSCo Open-Source Toolflowfor the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems
    Carsten Heinz
    Jaco Hofmann
    Jens Korinth
    Lukas Sommer
    Lukas Weber
    Andreas Koch
    Journal of Signal Processing Systems, 2021, 93 : 545 - 563
  • [28] The TaPaSCo Open-Source Toolflow for the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems
    Heinz, Carsten
    Hofmann, Jaco
    Korinth, Jens
    Sommer, Lukas
    Weber, Lukas
    Koch, Andreas
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2021, 93 (05): : 545 - 563
  • [29] An efficient parallel algorithm of N-hop neighborhoods on graphs in distributed environment
    Wenjie Liu
    Zhanhuai Li
    Frontiers of Computer Science, 2019, 13 : 1309 - 1325
  • [30] An efficient parallel algorithm of N-hop neighborhoods on graphs in distributed environment
    Liu, Wenjie
    Li, Zhanhuai
    FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (06) : 1309 - 1325