Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming

被引：10

作者：

Fujita, Norihisa ^{[1
]}

Kobayashi, Ryohei ^{[1
,2
]}

Yamaguchi, Yoshiki ^{[1
,2
]}

Boku, Taisuke ^{[1
,2
]}

机构：

[1] Univ Tsukuba, Ctr Computat Sci, Tsukuba, Ibaraki, Japan

[2] Univ Tsukuba, Grad Sch Syst & Informat Engn, Tsukuba, Ibaraki, Japan

来源：

2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2019年

关键词：

FPGA; OpenCL; HSL; parallel computing; inter-connection;

D O I：

10.1109/IPDPSW.2019.00089

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. "Channel" is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.

引用

页码：479 / 488

页数：10

共 11 条

[1] Baxter R, 2007, NASA/ESA CONFERENCE ON ADAPTIVE HARDWARE AND SYSTEMS, PROCEEDINGS, P287
[2] Fujita Norihisa, 2018, P 9 INT S HIGHL EFF
[3] Hill K, 2015, IEEE INT CONF ASAP, P189, DOI 10.1109/ASAP.2015.7245733
[4] Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer
Idomura, Yasuhiro
Nakata, Motoki
Yamada, Susumu
Machida, Masahiko
Imamura, Toshiyuki
Watanabe, Tomohiko
Nunami, Masanori
Inoue, Hikaru
Tsutsumi, Shigenobu
Miyoshi, Ikuo
Shida, Naoyuki
[J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2014, 28 (01) : 73 - 86
[5] OpenCL-based FPGA Design to Accelerate the Nodal Discontinuous Galerkin Method for Unstructured Meshes
Kenter, Tobias
Mahale, Gopinath
Alhaddad, Samer
Grynko, Yevgen
Foerstner, Jens
Plessl, Christian
Schmitt, Christian
Afzal, Ayesha
Hannig, Frank
[J]. PROCEEDINGS 26TH IEEE ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2018), 2018, : 189 - 196
[6] OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing
Kobayashi, Ryohei
Oobata, Yuma
Fujita, Norihisa
Yamaguchi, Yoshiki
Boku, Taisuke
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2018), 2018, : 192 - 201
[7] NVIDIA Corporation, GPUDIRECT RDMA
[8] Putnam A, 2014, CONF PROC INT SYMP C, P13, DOI 10.1109/ISCA.2014.6853195
[9] RIKEN, HIMENO BENCHMARK
[10] Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth
Sano, Kentaro
Hatsuda, Yoshiaki
Yamamoto, Satoru
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (03) : 695 - 705

← 1 2 →