Efficient deployment of Single Shot Multibox Detector network on FPGAs☆ ☆

被引：1

作者：

Qian, Wei ^{[1
]}

Zhu, Zhengwei ^{[1
]}

Zhu, Chenyang ^{[2
]}

Luo, Weibin ^{[2
]}

Zhu, Yanping ^{[1
]}

机构：

[1] Changzhou Univ, Sch Microelect & Control Engn, Changzhou 213146, Peoples R China

[2] Changzhou Univ, Sch Comp Sci & Artificial Intelligence, Changzhou 213146, Peoples R China

来源：

INTEGRATION-THE VLSI JOURNAL | 2024年 / 99卷

关键词：

SSD algorithm; FPGA; Object detection; Hardware acceleration; HLS; PYNQ;

D O I：

10.1016/j.vlsi.2024.102255

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

FPGAs, characterized by their low power consumption and swift response, are ideally suited for parallel computations associated with object detection tasks, making them a popular choice for target detection and neural network acceleration. However, contemporary FPGA designs often come with high costs and resource demands, which limit their adoption in resource-constrained embedded and edge devices. This study presents a novel design that addresses these limitations by emphasizing cost-effectiveness, energy efficiency, and rapid performance, particularly for single-shot multi-box detectors. The design employs an Xilinx ZYNQ7020-based main control chip and leverages parallel computing models for convolution layers and feature extraction. It enhances efficiency by proposing parallel feature extraction at the network architecture level and integrates convolution activation and pooling in a single, hardware-optimized operation for convolution kernel computations. The design employs alternating memory reuse for feature layer inputs and outputs to optimize memory management, thereby reducing read/write delays and transmission times. Implemented on a PYNQ-Z2 development board and tested using Jupyter Notebook, the SSD algorithm demonstrates a 789.4 GOPS inference performance with 16-bit fixed-point quantization at a 200MHz clock frequency, achieving an average accuracy of 77.84% and an inference time of 81.4621 ms, while consuming 1.595 watts of power. This innovative design significantly boosts energy efficiency by up to 2590%, outperforming contemporary methods.

引用

页数：11

共 27 条

[1]

[Anonymous], 2024, Xilinx vitis-HLS official manual

[2]

[Anonymous], 2021, A Dissertation Submitted to Southeast University For the Academic Degree of Doctor of Engineering (Ph.D. thesis),

[3]

Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579

[4] A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection [J].

Duy Thanh Nguyen ;

Tuan Nghia Nguyen ;

Kim, Hyun ;

Lee, Hyuk-Jae .

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2019, 27 (08) :1861-1873

[5]

Fukagai T, 2018, IEEE IMAGE PROC, P301, DOI 10.1109/ICIP.2018.8451814

[6] Rich feature hierarchies for accurate object detection and semantic segmentation [J].

Girshick, Ross ;

Donahue, Jeff ;

Darrell, Trevor ;

Malik, Jitendra .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :580-587

[7] Improving HW/SW Adaptability for Accelerating CNNs on FPGAs Through A Dynamic/Static Co-Reconfiguration Approach [J].

Gong, Lei ;

Wang, Chao ;

Li, Xi ;

Zhou, Xuehai .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (07) :1854-1865

[8]

Gonzalez J.E., 2012, OSDI, V12, P2, DOI DOI 10.5555/2387880.2387883

[9] A Parallel Optimization of the Fast Algorithm of Convolution Neural Network on CPU [J].

Huang, JiaHao ;

Wang, Tiejun ;

Zhu, Xuhui ;

Wei, Min ;

Wu, Tao ;

Wu, Xi ;

Huang, Min .

2018 10TH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION (ICMTMA), 2018, :5-9

[10] Segmentation and Classification of Cervical Cells Using Deep Learning [J].

Kurnianingsih ;

Allehaibi, Khalid Hamed S. ;

Nugroho, Lukito Edi ;

Widyawan ;

Lazuardi, Lutfan ;

Prabuwono, Anton Satria ;

Mantoro, Teddy .

IEEE ACCESS, 2019, 7 :116925-116941

← 1 2 3 →