Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming

被引:10
作者
Fujita, Norihisa [1 ]
Kobayashi, Ryohei [1 ,2 ]
Yamaguchi, Yoshiki [1 ,2 ]
Boku, Taisuke [1 ,2 ]
机构
[1] Univ Tsukuba, Ctr Computat Sci, Tsukuba, Ibaraki, Japan
[2] Univ Tsukuba, Grad Sch Syst & Informat Engn, Tsukuba, Ibaraki, Japan
来源
2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2019年
关键词
FPGA; OpenCL; HSL; parallel computing; inter-connection;
D O I
10.1109/IPDPSW.2019.00089
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. "Channel" is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.
引用
收藏
页码:479 / 488
页数:10
相关论文
共 11 条
  • [1] Baxter R, 2007, NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS, PROCEEDINGS, P287
  • [2] Fujita Norihisa, 2018, P 9 INT S HIGHL EFF
  • [3] Hill K, 2015, IEEE INT CONF ASAP, P189, DOI 10.1109/ASAP.2015.7245733
  • [4] Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer
    Idomura, Yasuhiro
    Nakata, Motoki
    Yamada, Susumu
    Machida, Masahiko
    Imamura, Toshiyuki
    Watanabe, Tomohiko
    Nunami, Masanori
    Inoue, Hikaru
    Tsutsumi, Shigenobu
    Miyoshi, Ikuo
    Shida, Naoyuki
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2014, 28 (01) : 73 - 86
  • [5] OpenCL-based FPGA Design to Accelerate the Nodal Discontinuous Galerkin Method for Unstructured Meshes
    Kenter, Tobias
    Mahale, Gopinath
    Alhaddad, Samer
    Grynko, Yevgen
    Foerstner, Jens
    Plessl, Christian
    Schmitt, Christian
    Afzal, Ayesha
    Hannig, Frank
    [J]. PROCEEDINGS 26TH IEEE ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2018), 2018, : 189 - 196
  • [6] OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing
    Kobayashi, Ryohei
    Oobata, Yuma
    Fujita, Norihisa
    Yamaguchi, Yoshiki
    Boku, Taisuke
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2018), 2018, : 192 - 201
  • [7] NVIDIA Corporation, GPUDIRECT RDMA
  • [8] Putnam A, 2014, CONF PROC INT SYMP C, P13, DOI 10.1109/ISCA.2014.6853195
  • [9] RIKEN, HIMENO BENCHMARK
  • [10] Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth
    Sano, Kentaro
    Hatsuda, Yoshiaki
    Yamamoto, Satoru
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (03) : 695 - 705