Energy- and time-efficient matrix multiplication on FPGAs

被引:47
作者
Jang, JW [1 ]
Choi, SB
Prasanna, VK
机构
[1] Sogang Univ, Dept Elect Engn, Seoul, South Korea
[2] Intel Corp, Chandler, AZ 85248 USA
[3] Univ So Calif, Dept Elect Engn Syst, Los Angeles, CA 90089 USA
基金
美国国家科学基金会;
关键词
algorithm design; configurable hardware; energy-delay tradeoff; field-programmable gate array (FPGA); linear array; matrix multiplication; performance estimation;
D O I
10.1109/TVLSI.2005.859562
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We develop new algorithms and architectures for matrix multiplication on configurable devices. These have reduced energy dissipation and latency compared with the state-of-the-art field-programmable gate array (FPGA)-based designs. By profiling well-known designs, we identify "energy hot spots," which are responsible for most of the energy dissipation. Based on this, we develop algorithms and architectures that offer tradeoffs among the number of I/O ports, the number of registers, and the number of PEs. To avoid time-consuming low-level simulations for energy profiling and performance prediction of many alternate designs, we derive functions to represent the impact of algorithm design choices on the system-wide energy dissipation, area, and latency. These functions are used to either optimize the energy performance or provide tradeoffs for a family of candidate algorithms and architectures. For selected designs, we perform extensive low-level simulations using state-of-the-art tools and target FPGA devices. We show a design space for matrix multiplication on FPGAs that results in tradeoffs; among energy, area, and latency. For example, our designs improve the energy performance of state-of-the-art FPGA-based designs by 29%-51% without any increase in the area-latency product. The latency of our designs is reduced one-third to one-fifteenth while area is increased 1.9-9.4 times. In terms of comprehensive metrics such as Energy-Area-Time, our designs exhibit superior performance compared with the state-of-the-art by 50%-79%.
引用
收藏
页码:1305 / 1319
页数:15
相关论文
共 22 条
  • [1] Amira A., 2001, Field Programmable Logic and Applications. 11th International Conference, FPL 2001. Proceedings (Lecture Notes in Computer Science Vol.2147), P101
  • [2] A low-power, high-performance, 1024-point FFT processor
    Baas, BM
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 1999, 34 (03) : 380 - 387
  • [3] BECKER J, 2002, P FIELD PROGR LOG IT, P312
  • [4] Regression-based RTL power modeling
    Bogliolo, A
    Benini, L
    De Micheli, G
    [J]. ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2000, 5 (03) : 337 - 372
  • [5] Bowerman BL., 1990, LINEAR STAT MODELS A
  • [6] BREBNER G, 1999, P FIELD PROGR LOG IT, P195
  • [7] Choi S, 2003, INT CONF ACOUST SPEE, P521
  • [8] Domain-specific modeling for rapid energy estimation of reconfigurable architectures
    Choi, S
    Jang, JW
    Mohanty, S
    Prasanna, VK
    [J]. JOURNAL OF SUPERCOMPUTING, 2003, 26 (03) : 259 - 281
  • [9] CHOI S, 2003, P 2003 ACM SIGDA 11, P225
  • [10] FIDANCI OD, 2003, P INT PAR DISTR PROC, P176