A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks

被引：10

作者：

Quang Hieu Vo ^{[1
]}

Ngoc Linh Le ^{[1
]}

Asim, Faaiz ^{[1
]}

Kim, Lok-Won ^{[1
]}

Hong, Choong Seon ^{[1
]}

机构：

[1] Kyung Hee Univ, Dept Comp Sci & Engn, Yongin 17104, South Korea

来源：

IEEE ACCESS | 2022年 / 10卷

基金：

新加坡国家研究基金会;

关键词：

Computer architecture; Hardware; Throughput; Neural networks; Computational modeling; Optimization; Logic gates; Binary neural networks; deep learning accelerators; FPGAs;

D O I：

10.1109/ACCESS.2022.3151916

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep neural networks (DNNs) have played an increasingly important role in various areas such as computer vision and voice recognition. While training and validation become gradually feasible with high-end general-purpose processors such as graphical processor units (GPU), high throughput inferences in embedded hardware platforms with low hardware resources and power consumption efficiency are still challenging. Binarized neural networks (BNNs) are emerging as a promising method to overcome these challenges by reducing bit widths of DNN data representations with many optimal prior solutions. However, accuracy degradation is a considerable problem of the BNN, compared to the same architecture with full precision, while the binary neural networks still contain significant redundancy for optimization. In this paper, to address the limitations, we implement a streaming accelerator architecture with three optimization techniques: pipelining-unrolling for streaming each layer, weight reuse for parallel computation, and MAC (multiplication-accumulation) compression. Our method first constructs streaming architecture by pipelining-unrolling method to maximize throughput. Next, the weight reuse method with the K-mean cluster is applied to reduce the complexity of the popcount operation. Finally, MAC compression reduces hardware resources used for remaining computation on MAC operations. The implemented hardware accelerator integrated into a state-of-the-art field programable gate array (FPGA) provides the maximum performance of the classification at 1531k frames per second with 98.4% accuracy for the MNIST dataset and 205K frame per second with 80.2% accuracy for the Cifar-10 dataset. Besides, the proposed design's ratio FPS/LUTs is approximately 57 (MNIST) and 0.707 (Cifar-10), which is much lower than the state-of-the-art design with a comparable throughput and inference accuracy.

引用

页码：21141 / 21159

页数：19

共 33 条

[1]

[Anonymous], The cifar-10 dataset, in

[2]

Arthur D, 2006, 200613 STANF INFOLAB

[3] A CNN Accelerator on FPGA Using Depthwise Separable Convolution [J].

Bai, Lin ;

Zhao, Yiming ;

Huang, Xinming .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2018, 65 (10) :1415-1419

[4] FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks [J].

Blott, Michaela ;

Preusser, Thomas B. ;

Fraser, Nicholas J. ;

Gambardella, Giulio ;

O'Brien, Kenneth ;

Umuroglu, Yaman ;

Leeser, Miriam ;

Vissers, Kees .

ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2018, 11 (03)

[5]

Chauhan C., 2012, International Journal of Computer Applications, V52, P12, DOI DOI 10.5120/8189-1550

[6]

Chen YH, 2016, ISSCC DIG TECH PAP I, V59, P262, DOI 10.1109/ISSCC.2016.7418007

[7]

Fraser N.J., 2017, Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms, P25

[8]

Fu Cheng., 2019, Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, P306, DOI 10.1145/3289602.3293990

[9] ReBNet: Residual Binarized Neural Network [J].

Ghasemzadeh, Mohammad ;

Samragh, Mohammad ;

Koushanfar, Farinaz .

PROCEEDINGS 26TH IEEE ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2018), 2018, :57-64

[10] Fast Branch and Bound Algorithm for the Travelling Salesman Problem [J].

Grymin, Radoslaw ;

Jagiello, Szymon .

COMPUTER INFORMATION SYSTEMS AND INDUSTRIAL MANAGEMENT, CISIM 2016, 2016, 9842 :206-217

← 1 2 3 4 →