Multilevel Granularity Parallelism Synthesis on FPGAs

被引:28
作者
Papakonstantinou, Alexandros [1 ]
Liang, Yun [2 ]
Stratton, John A. [1 ]
Gururaj, Karthik [3 ]
Chen, Deming [1 ]
Hwu, Wen-Mei W. [1 ]
Cong, Jason [3 ]
机构
[1] Univ Illinois, Elect & Comp Eng Dept, Urbana, IL 61801 USA
[2] Adv Digital Sci Ctr, Singapore, Singapore
[3] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA USA
来源
2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM) | 2011年
关键词
FPGA; High-Level Sytnthesis; Parallel Computing; Design Space Exploration;
D O I
10.1109/FCCM.2011.29
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent progress in High-Level Synthesis (HLS) techniques has helped raise the abstraction level of FPGA programming. However implementation and performance evaluation of the HLS-generated RTL, involves lengthy logic synthesis and physical design flows. Moreover, mapping of different levels of coarse grained parallelism onto hardware spatial parallelism affects the final FPGA-based performance both in terms of cycles and frequency. Evaluation of the rich design space through the full implementation flow - starting with high level source code and ending with routed netlist - is prohibitive in various scientific and computing domains, thus hindering the adoption of reconfigurable computing. This work presents a framework for multilevel granularity parallelism exploration with HLS-order of efficiency. Our framework considers different granularities of parallelism for mapping CUDA kernels onto high performance FPGA-based accelerators. We leverage resource and clock period models to estimate the impact of multi-granularity parallelism extraction on execution cycles and frequency. The proposed Multilevel Granularity Parallelism Synthesis (ML-GPS) framework employs an efficient design space search heuristic in tandem with the estimation models as well as design layout information to derive a performance near-optimal configuration. Our experimental results demonstrate that ML-GPS can efficiently identify and generate CUDA kernel configurations that can significantly outperform previous related tools whereas it can offer competitive performance compared to software kernel execution on GPUs at a fraction of the energy cost.
引用
收藏
页码:178 / 185
页数:8
相关论文
共 20 条
[1]  
[Anonymous], 2009, P IEEE ACM INT C COM
[2]  
[Anonymous], 2008, High-Level Synthesis
[3]  
[Anonymous], 2008, OpenMP Application Program Interface
[4]  
Bilavarn S., 2006, COMPUTER AIDED DESIG, V25
[5]  
Cabrera D., 2009, P IEEE INT C SYST AR
[6]  
Cong J., 2006, P IEEE INT SOC C
[7]   Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system [J].
Diniz, P ;
Hall, M ;
Park, J ;
So, B ;
Ziegler, H .
MICROPROCESSORS AND MICROSYSTEMS, 2005, 29 (2-3) :51-62
[8]  
Hagiescu A., 2009, P IEEE ACM DES AUT C
[9]  
IMPACT Rresearch Group, 2010, PARB BENCHM SUIT
[10]  
Impulse Accelerated Technologies, 2010, IMP CODEVELOPER