High-performance reduction circuits using deeply pipelined operators on FPGAs

被引:46
作者
Zhuo, Ling
Morris, Gerald R.
Prasanna, Viktor K.
机构
[1] Univ So Calif, Dept Elect Engn Syst, Los Angeles, CA 90089 USA
[2] CEERD IH, Vicksburg, MS 39180 USA
基金
美国国家科学基金会;
关键词
parallel algorithms; reconfigurable hardware;
D O I
10.1109/TPDS.2007.1068
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Field-programmable gate arrays ( FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-II Pro FPGA as the target device, we implemented our designs and present performance and area results.
引用
收藏
页码:1377 / 1392
页数:16
相关论文
共 23 条
[1]  
BADER D, 2002, P INT C HIGH PERF CO
[2]  
BI Y, 2005, P INT C ENG REC SYST
[3]  
Conger C., 2005, P 8 ANN INT C MIL AE, P1
[4]  
Cormen T. H., 2001, Introduction to Algorithms, V2nd, DOI DOI 10.1145/963770.963776
[5]  
*CRAY INC, 2006, CRAY XD1
[6]  
Govindu Gokul, 2005, P INT C ENG REC SYST
[7]  
KANCHARLA P, 2003, P 6 ANN INT C MIL AE
[8]  
Kogge Peter M., 1981, ARCHITECTURE PIPELIN
[9]  
LAM CC, 1999, P 6 INT C HIGH PERF
[10]  
MORRIS GR, 2006, P 17 INT C APPL SPEC