Automated parallel execution of distributed task graphs with FPGA clusters

被引:0
|
作者
Ruiz, Juan Miguel de Haro [1 ,2 ]
Martinez, Carlos alvarez [1 ,2 ]
Jimenez-Gonzalez, Daniel [1 ,2 ]
Martorell, Xavier [1 ,2 ]
Ueno, Tomohiro [3 ]
Sano, Kentaro [3 ]
Ringlein, Burkhard [4 ]
Abel, Francois [4 ]
Weiss, Beat [4 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan
[4] IBM Res Europe, Zurich, Switzerland
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷
基金
日本学术振兴会; 欧盟地平线“2020”;
关键词
FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;
D O I
10.1016/j.future.2024.06.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.
引用
收藏
页码:808 / 824
页数:17
相关论文
共 48 条
  • [1] Hardware Implementation on FPGA for Task-Level Parallel Dataflow Execution Engine
    Wang, Chao
    Zhang, Junneng
    Li, Xi
    Wang, Aili
    Zhou, Xuehai
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (08) : 2303 - 2315
  • [2] HyperQueue: Efficient and ergonomic task graphs on HPC clusters
    Beranek, Jakub
    Bohm, Ada
    Palermo, Gianluca
    Martinovic, Jan
    Jansik, Branislav
    SOFTWAREX, 2024, 27
  • [3] Task assignment heuristics for parallel and distributed CFD applications
    Lopez-Benitez, Noe
    Djomehri, M. Jahed
    Biswas, Rupak
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2007, 3 (02) : 155 - 165
  • [4] Combining execution pipelines to improve parallel implementation of HMMER on FPGA
    Abbas, Naeem
    Derrien, Steven
    Rajopadhye, Sanjay
    Quinton, Patrice
    Cornu, Alexandre
    Lavenier, Dominique
    MICROPROCESSORS AND MICROSYSTEMS, 2015, 39 (07) : 457 - 470
  • [5] From Serial Loops to Parallel Execution on Distributed Systems
    Bosilca, George
    Bouteiller, Aurelien
    Danalis, Anthony
    Herault, Thomas
    Dongarra, Jack
    EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 246 - 257
  • [6] Teaching Parallel and Distributed Computing with MPI on Raspberry Pi Clusters
    Brown, Richard
    Adams, Joel
    Matthews, Suzanne
    Shoop, Elizabeth
    SIGCSE'18: PROCEEDINGS OF THE 49TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, 2018, : 1054 - 1054
  • [7] Optimal scheduling of task graphs on parallel systems
    Sinnen, Oliver
    Kozlov, Alexci Vladimirovich
    Shahul, Ahmed Zaki Semar
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING AND NETWORKS, 2007, : 170 - +
  • [8] Optimal Scheduling of Task Graphs on Parallel Systems
    Shahul, Ahmed Zaki Semar
    Sinnen, Oliver
    PDCAT 2008: NINTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2008, : 323 - +
  • [9] An Approach Towards Distributed DNN Training on FPGA Clusters
    Kreowsky, Philipp
    Knapheide, Justin
    Stabernack, Benno
    ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2024, 2024, 14842 : 18 - 32
  • [10] HyperShell v2: Distributed Task Execution for HPC
    Lentner, Geoffrey
    Gorenstein, Lev
    PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2022, 2022,