FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio

被引:57
作者
Huang, Wenjin [1 ]
Wu, Huangtao [1 ]
Chen, Qingkun [1 ]
Luo, Conghui [1 ]
Zeng, Shihao [1 ]
Li, Tianrui [1 ]
Huang, Yihua [1 ,2 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China
[2] Southern Marine Sci & Engn Guangdong Lab, Zhuhai 519080, Peoples R China
基金
中国国家自然科学基金;
关键词
Computer architecture; Throughput; Hardware; Field programmable gate arrays; Resource management; Convolution; Clocks; Convolutional neural network; FPGA-based accelerator architecture; highly efficient storage system; high resource utilization ratio;
D O I
10.1109/TNNLS.2021.3055814
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The field-programmable gate array (FPGA)-based CNN hardware accelerator adopting single-computing-engine (CE) architecture or multi-CE architecture has attracted great attention in recent years. The actual throughput of the accelerator is also getting higher and higher but is still far below the theoretical throughput due to the inefficient computing resource mapping mechanism and data supply problem, and so on. To solve these problems, a novel composite hardware CNN accelerator architecture is proposed in this article. To perform the convolution layer (CL) efficiently, a novel multiCE architecture based on a row-level pipelined streaming strategy is proposed. For each CE, an optimized mapping mechanism is proposed to improve its computing resource utilization ratio and an efficient data system with continuous data supply is designed to avoid the idle state of the CE. Besides, to relieve the off-chip bandwidth stress, a weight data allocation strategy is proposed. To perform the fully connected layer (FCL), a single-CE architecture based on a batch-based computing method is proposed. Based on these design methods and strategies, visual geometry group network-16 (VGG-16) and ResNet-101 are both implemented on the XC7VX980T FPGA platform. The VGG-16 accelerator consumed 3395 multipliers and got the throughput of 1 TOPS at 150 MHz, that is, about 98.15% of the theoretical throughput (2 x 3395 x150 MOPS). Similarly, the ResNet-101 accelerator achieved 600 GOPS at 100 MHz, about 96.12% of the theoretical throughput (2 x3121 x 100 MOPS).
引用
收藏
页码:4069 / 4083
页数:15
相关论文
共 31 条
[1]   An OpenCLTM Deep Learning Accelerator on Arria 10 [J].
Aydonat, Utku ;
O'Connell, Shane ;
Capalija, Davor ;
Ling, Andrew C. ;
Chiu, Gordon R. .
FPGA'17: PROCEEDINGS OF THE 2017 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS, 2017, :55-64
[2]   Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Emer, Joel ;
Sze, Vivienne .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :367-379
[3]  
Dai J, 2016, PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY (ICIT), P1796, DOI 10.1109/ICIT.2016.7475036
[4]  
Duan YZ, 2018, INT CONF DIGIT SIG
[5]   A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection [J].
Duy Thanh Nguyen ;
Tuan Nghia Nguyen ;
Kim, Hyun ;
Lee, Hyuk-Jae .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2019, 27 (08) :1861-1873
[6]   Breaking High-Resolution CNN Bandwidth Barriers With Enhanced Depth-First Execution [J].
Goetschalckx, Koen ;
Verhelst, Marian .
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) :323-331
[7]   Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA [J].
Guo, Kaiyuan ;
Sui, Lingzhi ;
Qiu, Jiantao ;
Yu, Jincheng ;
Wang, Junbin ;
Yao, Song ;
Han, Song ;
Wang, Yu ;
Yang, Huazhong .
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (01) :35-47
[8]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90
[9]  
Li H, 2016, 2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), P1, DOI 10.1109/RTEICT.2016.7807769
[10]   FP-BNN: Binarized neural network on FPGA [J].
Liang, Shuang ;
Yin, Shouyi ;
Liu, Leibo ;
Luk, Wayne ;
Wei, Shaojun .
NEUROCOMPUTING, 2018, 275 :1072-1086