Array-Specific Dataflow Caches for High-Level Synthesis of Memory-Intensive Algorithms on FPGAs

被引:4
作者
Brignone, Giovanni [1 ]
Jamal, M. Usman [1 ]
Lazarescu, Mihai T. [1 ]
Lavagno, Luciano [1 ]
机构
[1] Politecn Torino, Dept Elect & Telecommun, I-10129 Turin, Italy
关键词
Cache; FPGA; high-level synthesis; memory management;
D O I
10.1109/ACCESS.2022.3219868
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Designs implemented on field-programmable gate arrays (FPGAs) via high-level synthesis (HLS) suffer from off-chip memory latency and bandwidth bottlenecks. FPGAs can access both large but slow off-chip memories (DRAM), and fast but small on-chip memories (block RAMs and registers). HLS tools allow exploiting the memory hierarchy in a scratchpad-like fashion, requring a significant manual effort. We propose an automation of the FPGA memory management in Xilinx Vitis HLS through a fully-configurable C++ source-level cache. Each DRAM-mapped array can be associated with a private level 2 (L2) cache with one or more ports, and each port can optionally provide level 1 cache. The L2 cache runs in a separate dataflow task with respect to the application accessing it. This solution isolates off-chip memory accesses and data buffering into dedicated dataflow tasks, resembling the load, compute, store design paradigm, but without the drawback of manual algorithm refactoring. Experimental results collected from FPGA board show that our cache speeds up the execution of a variety of benchmarks by up to 60 times compared to the out-of-the-box solution provided by HLS, requiring very limited optimization effort. Our caches are not meant to compete with manually optimized implementations quality of results (QoR), but rather to significantly save design effort, in exchange for some QoR, to make the HLS flow a bit more software-like, allowing the designer to focus on algorithmic optimizations, rather than on explicit memory management. Moreover, caching could be the only feasible memory optimization for algorithms with data-dependent or irregular memory access patterns, but with good data locality.
引用
收藏
页码:118858 / 118877
页数:20
相关论文
共 27 条
[1]  
Adler M, 2011, FPGA 11: PROCEEDINGS OF THE 2011 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS, P25
[2]  
Avnet, ULTRA96 HARDW US GUI
[3]   The Future of Microprocessors [J].
Borkar, Shekhar ;
Chien, Andrew A. .
COMMUNICATIONS OF THE ACM, 2011, 54 (05) :67-77
[4]  
Chi YZ, 2021, ANN IEEE SYM FIELD P, P204, DOI [10.1109/FCCM51124.2021.00032, 10.1109/fccm51124.2021.00032]
[5]   From Pthreads to Multicore Hardware Systems in LegUp High-Level Synthesis for FPGAs [J].
Choi, Jongsok ;
Brown, Stephen D. ;
Anderson, Jason H. .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2017, 25 (10) :2867-2880
[6]   Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems [J].
Choi, Jongsok ;
Nam, Kevin ;
Canis, Andrew ;
Anderson, Jason ;
Brown, Stephen ;
Czajkowski, Tomasz .
2012 IEEE 20TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2012, :17-24
[7]  
Cong JS, 2012, DES AUT CON, P1235
[8]  
Licht JD, 2019, Arxiv, DOI arXiv:1910.04436
[9]   Transformations of High-Level Synthesis Codes for High-Performance Computing [J].
de Fine Licht, Johannes ;
Besta, Maciej ;
Meierhans, Simon ;
Hoefler, Torsten .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (05) :1014-1029
[10]  
Intel, INT HIGH LEV SYNTH C