Logarithm-approximate floating-point multiplier is applicable to power-efficient neural network training

被引：13

作者：

Cheng, TaiYu ^{[1
]}

Masuda, Yukata ^{[2
]}

Chen, Jun ^{[1
]}

Yu, Jaehoon ^{[3
]}

Hashimoto, Masanori ^{[1
]}

机构：

[1] Osaka Univ, Dept Informat Syst Engn, Suita, Osaka, Japan

[2] Nagoya Univ, Grad Sch Informat, Ctr Embedded Comp Syst, Nagoya, Aichi, Japan

[3] Tokyo Inst Technol, Inst Innovat Res, Tokyo, Japan

来源：

INTEGRATION-THE VLSI JOURNAL | 2020年 / 74卷

关键词：

Approximate computing; Neural network; Training engine; Floating-point unit; Logarithm multiplier; GPU design; EDGE;

D O I：

10.1016/j.vlsi.2020.05.002

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, emerging "edge computing" moves data and services from the cloud to nearby edge servers to achieve short latency and wide bandwidth, and solve privacy concerns. However, edge servers, often embedded with GPU processors, highly demand a solution for power-efficient neural network (NN) training due to the limitation of power and size. Besides, according to the nature of the broad dynamic range of gradient values computed in NN training, floating-point representation is more suitable. This paper proposes to adopt a logarithm-approximate multiplier (LAM) for multiply-accumulate (MAC) computation in neural network (NN) training engines, where LAM approximates a floating-point multiplication as a fixed-point addition, resulting in smaller delay, fewer gates, and lower power consumption. We demonstrate the efficiency of LAM in two platforms, which are dedicated NN training hardware, and open-source GPU design. Compared to the NN training applying the exact multiplier, our implementation of the NN training engine for a 2-D classification dataset achieves 10% speed-up and 2.3X efficiency improvement in power and area, respectively. LAM is also highly compatible with conventional bit-width scaling (BWS). When BWS is applied with LAM in five test datasets, the implemented training engines achieve more than 4.9X power efficiency improvement, with at most 1% accuracy degradation, where 2.2X improvement originates from LAM. Also, the advantage of LAM can be exploited in processors. A GPU design embedded with LAM executing an NN-training workload, which is implemented in an FPGA, presents 1.32X power efficiency improvement, and the improvement reaches 1.54X with LAM + BWS. Finally, LAM-based training in deeper NN is evaluated. Up to 4-hidden layer NN, LAM-based training achieves highly comparable accuracy as that of the accurate multiplier, even with aggressive BWS.

引用

页码：19 / 31

页数：13

共 32 条

[1]

[Anonymous], 2018, ISLPED

[2]

[Anonymous], ASPLOS

[3]

[Anonymous], ISSCC

[4]

[Anonymous], PATMOS

[5]

[Anonymous], ELECTRONICS

[6]

[Anonymous], 2012, International Journal of Pavement Engineering, DOI DOI 10.1080/10298436.2012.693180

[7]

[Anonymous], CAMBRIDGE MASS

[8]

[Anonymous], 2016, ARXIV160301025

[9]

Bush J., 2015, NYUZIPROCESSOR SOURC

[10]

Chang C., 1996, FOURCLASS

← 1 2 3 4 →