Design space exploration of multi-core RTL via high level synthesis from OpenCL models

被引：2

作者：

Roozmeh, Mehdi ^{[1
]}

Lavagno, Luciano ^{[2
]}

机构：

[1] Politecn Torino, Elect Engn, Turin, Italy

[2] Politecn Torino, Turin, Italy

来源：

MICROPROCESSORS AND MICROSYSTEMS | 2018年 / 63卷

关键词：

Design space exploration; Data center; FPGA; GPU; OpenCL; High-level synthesis; Low-power low-energy computations; Parallel computing;

D O I：

10.1016/j.micpro.2018.09.009

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of available computing power. Designing optimized accelerators that can meet particular requirements has always presented a tremendous challenge to hardware engineers. To do so, designers have to trade off performance for power consumption in a manner such that the final RTL consumes minimum energy to meet the required performance (e.g. FLOPS) target. Moreover, the growing trend towards heterogeneous platforms is crucial to meet time and power consumption constraints of high-performance computing (HPC) applications. The OpenCL parallel programming language and framework enables programming CPU, GPU and recently FPGAs using the high-level synthesis (HLS) methodology. This work presents a design space exploration flow based on execution time, resource utilization and power consumption of OpenCL kernels mapped on FPGAs using the Xilinx high-level synthesis tool chain. Our experiments suggest that the quality of generated solutions, in terms of performance-per-watt, can be determined using analytical formulas prior to implementation, thus enabling fast and accurate DSE by considering on-chip and off-chip sources of parallelism. Moreover, the automated flow suggests design hints to meet a given time constraint within available resources. The proposed technique is demonstrated by optimizing the well known bitonic sorting network from NVIDIA's OpenCL benchmark. Our results report that FPGAs have at least 20% higher performance-per-watt with respect to two high-end CPUs manufactured in the same technology (28 nm). Additionally, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming at least 55% less power at the cost of more expensive devices. (C) 2018 Published by Elsevier B.V.

引用

页码：199 / 208

页数：10

共 25 条

[1]

[Anonymous], 2015, UG1023 XIL

[2]

[Anonymous], 2016, UG1207 XIL

[3]

Butt S. A., 2011, P 2011 SAUD INT EL C, P1

[4]

Butt SA, 2013, INT DES TEST SYMP

[5]

Cilardo A, 2015, DES AUT TEST EUROPE, P163

[6]

Denisenko D., 2016, P 4 INT WORKSH OPENC, P4

[7] Parallel design of JPEG-LS encoder on graphics processing units [J].

Duan, Hao ;

Fang, Yong ;

Huang, Bormin .

JOURNAL OF APPLIED REMOTE SENSING, 2012, 6

[8]

Guzel A. E., 2016, 2016 IEEE E W DES TE, P1, DOI DOI 10.1109/EWDTS.2016.7807644

[9]

Hemsoth N., 2017, FPGA FRONTIERS NEW A

[10]

Intel FPGA RTE for OpenCL, 2017, GETT START GUID

← 1 2 3 →