High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic

被引：111

作者：

Lian, Xiaocong ^{[1
]}

Liu, Zhenyu ^{[2
]}

Song, Zhourui ^{[3
]}

Dai, Jiwu ^{[4
]}

Zhou, Wei ^{[4
]}

Ji, Xiangyang ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China

[2] Tsinghua Univ, Tsinghua Natl Lab Informat Sci & Technol, Res Inst Informat Technol, Beijing 100084, Peoples R China

[3] Beijing Univ Posts & Telecommun BUPT, Sch Cyberspace Secur, Beijing 100876, Peoples R China

[4] Northwestern Polytech Univ, Sch Elect & Informat, Xian 710129, Shaanxi, Peoples R China

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2019年 / 27卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Block floating point (BFP); convolutional neural network (CNN) accelerator; field-programmable gate array (FPGA); three-level parallel;

D O I：

10.1109/TVLSI.2019.2913958

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Convolutional neural networks (CNNs) are widely used and have achieved great success in computer vision and speech processing applications. However, deploying the large-scale CNN model in the embedded system is subject to the constraints of computation and memory. An optimized block-floating-point (BFP) arithmetic is adopted in our accelerator for efficient inference of deep neural networks in this paper. The feature maps and model parameters are represented in 16-bit and 8-bit formats, respectively, in the off-chip memory, which can reduce memory and off-chip bandwidth requirements by 50% and 75% compared to the 32-bit FP counterpart. The proposed 8-bit BFP arithmetic with optimized rounding and shifting-operation-based quantization schemes improves the energy and hardware efficiency by three times. One CNN model can be deployed in our accelerator without retraining at the cost of an accuracy loss of not more than 0.12%. The proposed reconfigurable accelerator with three parallelism dimensions, ping-pong off-chip DDR3 memory access, and an optimized on-chip buffer group is implemented on the Xilinx VC709 evaluation board. Our accelerator achieves a performance of 760.83 GOP/s and 82.88 GOP/s/W under a 200-MHz working frequency, significantly outperforming previous accelerators.

引用

页码：1874 / 1885

页数：12

共 33 条

[1] NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps [J].

Aimar, Alessandro ;

Mostafa, Hesham ;

Calabrese, Enrico ;

Rios-Navarro, Antonio ;

Tapiador-Morales, Ricardo ;

Lungu, Iulia-Alexandra ;

Milde, Moritz B. ;

Corradi, Federico ;

Linares-Barranco, Alejandro ;

Liu, Shih-Chii ;

Delbruck, Tobi .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (03) :644-656

[2]

[Anonymous], P 3 INT C LEARNING R

[3]

[Anonymous], 8 BIT INFERENCE TENS

[4]

[Anonymous], PROC CVPR IEEE

[5]

[Anonymous], 2016, The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page, DOI DOI 10.1109/MICRO.2016.7783723

[6]

[Anonymous], 2017, COMMUN ACM, DOI DOI 10.1145/3065386

[7]

[Anonymous], 2015, PROC CVPR IEEE

[8]

[Anonymous], 2015, ARXIV PREPRINT ARXIV

[9]

[Anonymous], NVIDIA CUDNN GPU ACC

[10]

[Anonymous], CUDA TOOLK DOC FLOAT

← 1 2 3 4 →