A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection

被引：297

作者：

Duy Thanh Nguyen ^{[1
]}

Tuan Nghia Nguyen ^{[1
]}

Kim, Hyun ^{[2
,3
]}

Lee, Hyuk-Jae ^{[1
]}

机构：

[1] Seoul Natl Univ, Interuniv Semicond Res Ctr, Dept Elect & Comp Engn, Seoul 08826, South Korea

[2] Seoul Natl Univ Sci & Technol, Dept Elect & Informat Engn, Seoul 01811, South Korea

[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2019年 / 27卷 / 08期

关键词：

Binary weight; low-precision quantization; object detection; streaming architecture; you-only-look-once (YOLO);

D O I：

10.1109/TVLSI.2019.2905242

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Convolutional neural networks (CNNs) require numerous computations and external memory accesses. Frequent accesses to off-chip memory cause slow processing and large power dissipation. For real-time object detection with high throughput and power efficiency, this paper presents a Tera-OPS streaming hardware accelerator implementing a you-only-look-once (YOLO) CNN. The parameters of the YOLO CNN are retrained and quantized with the PASCAL VOC data set using binary weight and flexible low-bit activation. The binary weight enables storing the entire network model in block RAMs of a field-programmable gate array (FPGA) to reduce off-chip accesses aggressively and, thereby, achieve significant performance enhancement. In the proposed design, all convolutional layers are fully pipelined for enhanced hardware utilization. The input image is delivered to the accelerator line-by-line. Similarly, the output from the previous layer is transmitted to the next layer line-by-line. The intermediate data are fully reused across layers, thereby eliminating external memory accesses. The decreased dynamic random access memory (DRAM) accesses reduce DRAM power consumption. Furthermore, as the convolutional layers are fully parameterized, it is easy to scale up the network. In this streaming design, each convolution layer is mapped to a dedicated hardware block. Therefore, it outperforms the " one-size-fits-all" designs in both performance and power efficiency. This CNN implemented using VC707 FPGA achieves a throughput of 1.877 tera operations per second (TOPS) at 200 MHz with batch processing while consuming 18.29 W of on-chip power, which shows the best power efficiency compared with the previous research. As for object detection accuracy, it achieves a mean average precision (mAP) of 64.16% for the PASCAL VOC 2007 data set that is only 2.63% lower than the mAP of the same YOLO network with full precision.

引用

页码：1861 / 1873

页数：13

共 31 条

[1]

[Anonymous], 2016, BINARIZED NEURAL NET

[2]

[Anonymous], P IEEE ACM INT S MIC

[3]

[Anonymous], 2015, NEURIPS

[4]

[Anonymous], 2015, ARXIV PREPRINT ARXIV

[5]

[Anonymous], 2016, C NATL PRIORITIES PL

[6]

[Anonymous], 2015, ICLR

[7]

[Anonymous], 2016, COMPUTER VISION ECCV, DOI 10.48550/arXiv.1512.02325

[8]

[Anonymous], ZYNQ ULTR MPSOC DAT

[9]

[Anonymous], 2018, 7 SER DSP48E1 SLIC

[10]

[Anonymous], 2017, P IEEE C COMPUTER VI

← 1 2 3 4 →