High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

被引:4
|
作者
Hu, Xianghong [1 ]
Huang, Hongmin [1 ]
Li, Xueming [1 ]
Zheng, Xin [1 ]
Ren, Qinyuan [2 ]
He, Jingyu [3 ]
Xiong, Xiaoming [1 ]
机构
[1] Guangdong Univ Technol, Sch Microelectron, Guangzhou 510006, Guangdong, Peoples R China
[2] Zhejiang Univ, Coll Control Sci & Engn, Hangzhou, Peoples R China
[3] Hong Kong Univ Sci & Technol, Dept Elect & Comp Engn, Hong Kong 999077, Peoples R China
关键词
Convolutional neural networks; reconfigurable; accelerator; real-time object detection system; design space exploration; NEURAL-NETWORK; HARDWARE ACCELERATOR;
D O I
10.1145/3530818
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep convolutional neural networks (DNNs) have been widely used in many applications, particularly in machine vision. It is challenging to accelerate DNNs on embedded systems because real-world machine vision applications should reserve a lot of external memory bandwidth for other tasks, such as video capture and display, while leaving little bandwidth for accelerating DNNs. In order to solve this issue, in this study, we propose a high-throughput accelerator, called reconfigurable tiny neural network accelerator (ReTiNNA), for the bandwidth-limited system and present a real-time object detection system for the high-resolution video image. We first present a dedicated computation engine that takes different datamapping methods for various filter types to improve data reuse and reduce hardware resources. We then propose an adaptive layer-wise tiling strategy that tiles the feature maps into strips to reduce the control complexity of data transmission dramatically and to improve the efficiency of data transmission. Finally, a design space exploration (DSE) approach is presented to explore design space more accurately in the case of insufficient bandwidth to improve the performance of the low-bandwidth accelerator. With a low bandwidth of 2.23 GB/s and a low hardware consumption of 90.261K LUTs and 448 DSPs, ReTiNNA can still achieve a high performance of 155.86 GOPS on VGG16 and 68.20 GOPS on ResNet50, which is better than other state-of-the-art designs implemented on FPGA devices. Furthermore, the real-time object detection system can achieve a high object detection speed of 19 fps for high-resolution video.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Performance comparison of various end-to-end learning technologies with a bandwidth-limited OWC system
    Wei, Yuan
    Chen, Chaoxu
    Yao, Li
    Zhang, Haoyu
    Li, Ziwei
    Shen, Chao
    Hang, Unwen
    Chi, Nan
    Shi, Jianyang
    OPTICS EXPRESS, 2024, 32 (19): : 33401 - 33422
  • [2] Neural Network Detection for Bandwidth-Limited Non-Orthogonal Multiband CAP UVLC System
    Chen, Jiang
    Wang, Zhe
    Zhao, Yiheng
    Zhang, Junwen
    Li, Ziwei
    Shen, Chao
    Chi, Nan
    IEEE PHOTONICS JOURNAL, 2022, 14 (02):
  • [3] Minimalist Deployment of Neural Network Equalizers in a Bandwidth-Limited Optical Wireless Communication System with Knowledge Distillation
    Zhu, Yiming
    Wei, Yuan
    Chen, Chaoxu
    Chi, Nan
    Shi, Jianyang
    SENSORS, 2024, 24 (05)
  • [4] High performance reconfigurable accelerator for deep convolutional neural networks
    Qiao R.
    Chen G.
    Gong G.
    Lu H.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2019, 46 (03): : 130 - 139
  • [5] A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator
    Huang, Jiye
    Liu, Xin
    Guo, Tongdong
    Zhao, Zhijin
    ELECTRONICS, 2023, 12 (07)
  • [6] Teleport: A High-Performance ShiftNet Hardware Accelerator with Fused Layer Computation
    Kim, Hyunmin
    Ryu, Sungju
    2023 IEEE/ACM INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, ISLPED, 2023,
  • [7] AIX: A high performance and energy efficient inference accelerator on FPGA for a DNN-based commercial speech recognition
    Ahn, Minwook
    Hwang, Seok Joong
    Kim, Wonsub
    Jung, Seungrok
    Lee, Yeonbok
    Chung, Mookyoung
    Lim, Woohyung
    Kim, Youngjoon
    2019 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2019, : 1495 - 1500
  • [8] High-Performance Embedded System Design for QR Code Recognition With Deep Learning
    Gu, Wencheng
    Sun, Li
    Jiang, Zhipeng
    Sun, Kexue
    IEEE MULTIMEDIA, 2024, 31 (04) : 70 - 78
  • [9] MPI as a Programming Model for High-Performance Reconfigurable Computers
    Saldana, Manuel
    Patel, Arun
    Madill, Christopher
    Nunes, Daniel
    Wang, Danyao
    Chow, Paul
    Wittig, Ralph
    Styles, Henry
    Putnam, Andrew
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2010, 3 (04)
  • [10] A High-performance and High-programmability Reconfigurable Wireless Development Platform
    Chen, Jiahua
    Wang, Tao
    Wu, Haoyang
    Gong, Jian
    Li, Xiaoguang
    Hu, Yang
    Zhang, Gaohan
    Li, Zhiwei
    Yang, Junrui
    Lu, Songwu
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (FPT), 2014, : 350 - 353