Acceleration by Inline Cache for Memory-Intensive Algorithms on FPGA via High-Level Synthesis

被引：12

作者：

Ma, Liang ^{[1
]}

Lavagno, Luciano ^{[1
]}

Lazarescu, Mihai Teodor ^{[1
]}

Arif, Arslan ^{[1
]}

机构：

[1] Politecn Torino, Dept Elect & Telecommun, I-10129 Turin, Italy

来源：

IEEE ACCESS | 2017年 / 5卷

基金：

欧盟地平线“2020”;

关键词：

Cache; high-level synthesis; acceleration; FPGA; optimization; PERFORMANCE;

D O I：

10.1109/ACCESS.2017.2750923

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Using FPGA-based acceleration of high-performance computing (HPC) applications to reduce energy and power consumption is becoming an interesting option, thanks to the availability of high-level synthesis (HLS) tools that enable fast design cycles. However, obtaining good performance for memory-intensive algorithms, which often exchange large data arrays with external DRAM, still requires time-consuming optimization and good knowledge of hardware design. This article proposes a new design methodology, based on dedicated application-and data array-specific caches. These caches provide most of the benefits that can be achieved by coding optimized DMA-like transfer strategies by hand into the HPC application code, but require only limited manual tuning (basically the selection of architecture and size), are neutral to target HLS tool and technology (FPGA or ASIC), and do not require changes to application code. We show experimental results obtained on five common memory-intensive algorithms from very diverse domains, namely machine learning, data sorting, and computer vision. We test the cost and performance of our caches against both out-of-the-box code originally optimized for a GPU, and manually optimized implementations specifically targeted for FPGAs via HLS. The implementation using our caches achieved an 8X speedup and 2X energy reduction on average with respect to out-of-the-box models using only simple directive-based optimizations (e.g.,pipelining). They also achieved comparable performance with much less design effort when compared with the versions that were manually optimized to achieve efficient memory transfers specifically for an FPGA.

引用

页码：18953 / 18974

页数：22

共 31 条

[1]

Adler M, 2011, FPGA 11: PROCEEDINGS OF THE 2011 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS, P25

[2]

[Anonymous], OVERLOAD BRACKETS OP

[3]

[Anonymous], 2001, PYRAMIDAL IMPLEMENTA

[4]

[Anonymous], SDACCEL ENV OPT GUID

[5]

[Anonymous], 2017, INTEGRATION, DOI DOI 10.1109/ACCESS.2017.2671881

[6]

[Anonymous], 2015, P 2015 ACMSIGDA INT, DOI DOI 10.1145/2684746.2689073

[7]

[Anonymous], P 9 INT C SOL STAT I

[8]

[Anonymous], 2015, PROC ACMSIGDA INT S, DOI DOI 10.1145/2684746.2689083

[9]

Cheng SY, 2012, ANN IEEE SYM FIELD P, P157, DOI [10.1109/FCCM.2012.35, 10.1109/ICICIP.2012.6391507]

[10] Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems [J].

Choi, Jongsok ;

Nam, Kevin ;

Canis, Andrew ;

Anderson, Jason ;

Brown, Stephen ;

Czajkowski, Tomasz .

2012 IEEE 20TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2012, :17-24

← 1 2 3 4 →