FPQNet: Fully Pipelined and Quantized CNN for Ultra-Low Latency Image Classification on FPGAs Using OpenCAPI

被引:2
作者
Ji, Mengfei [1 ,2 ]
Al-Ars, Zaid [2 ]
Hofstee, Peter [2 ]
Chang, Yuchun [3 ]
Zhang, Baolin [1 ]
机构
[1] Jilin Univ, Coll Elect Sci & Engn, State Key Lab Integrated Optoelect, Changchun 130012, Peoples R China
[2] Delft Univ Technol, Dept Quantum & Comp Engn, NL-2628 CD Delft, Netherlands
[3] Dalian Univ Technol, Sch Microelect, Dalian 116620, Peoples R China
基金
中国国家自然科学基金;
关键词
CNNs; FPGA acceleration; HDMI; OpenCAPI; layer pipeline; channel parallelization; NEURAL-NETWORKS;
D O I
10.3390/electronics12194085
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Convolutional neural networks (CNNs) are to be effective in many application domains, especially in the computer vision area. In order to achieve lower latency CNN processing, and reduce power consumption, developers are experimenting with using FPGAs to accelerate CNN processing in several applications. Current FPGA CNN accelerators usually use the same acceleration approaches as GPUs, where operations from different network layers are mapped to the same hardware units working in a multiplexed manner. This will result in high flexibility in implementing different types of CNNs; however, this will degrade the latency that accelerators can achieve. Alternatively, we can reduce the latency of the accelerator by pipelining the processing of consecutive layers, at the expense of more FPGA resources. The continued increase in hardware resources available in FPGAs makes such implementations feasible for latency-critical application domains. In this paper, we present FPQNet, a fully pipelined and quantized CNN FPGA implementation that is channel-parallel, layer-pipelined, and network-parallel, to decrease latency and increase throughput, combined with quantization methods to optimize hardware utilization. In addition, we optimize this hardware architecture for the HDMI timing standard to avoid extra hardware utilization. This makes it possible for the accelerator to handle video datasets. We present prototypes of the FPQNet CNN network implementations on an Alpha Data 9H7 FPGA, connected with an OpenCAPI interface, to demonstrate architecture capabilities. Results show that with a 250 MHz clock frequency, an optimized LeNet-5 design is able to achieve latencies as low as 9.32 mu s with an accuracy of 98.8% on the MNIST dataset, making it feasible for utilization in high frame rate video processing applications. With 10 hardware kernels working concurrently, the throughput is as high as 1108 GOPs. The methods in this paper are suitable for many other CNNs. Our analysis shows that the latency of AlexNet, ZFNet, OverFeat-Fast, and OverFeat-Accurate can be as low as 69.27, 66.95, 182.98, and 132.6 mu s, using the architecture introduced in this paper, respectively.
引用
收藏
页数:19
相关论文
共 48 条
[1]   A Modified LeNet CNN for Breast Cancer Diagnosis in Ultrasound Images [J].
Balasubramaniam, Sathiyabhama ;
Velmurugan, Yuvarajan ;
Jaganathan, Dhayanithi ;
Dhanasekaran, Seshathiri .
DIAGNOSTICS, 2023, 13 (17)
[2]  
Baozhou Z, 2020, Arxiv, DOI arXiv:2009.05317
[3]   FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks [J].
Blott, Michaela ;
Preusser, Thomas B. ;
Fraser, Nicholas J. ;
Gambardella, Giulio ;
O'Brien, Kenneth ;
Umuroglu, Yaman ;
Leeser, Miriam ;
Vissers, Kees .
ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2018, 11 (03)
[4]   A Matrix-Multiply Unit for Posits in Reconfigurable Logic Leveraging (Open)CAPI [J].
Chen, Jianyu ;
Al-Ars, Zaid ;
Hofstee, H. Peter .
PROCEEDINGS OF THE CONFERENCE FOR NEXT GENERATION ARITHMETIC (CONGA'18), 2018,
[5]   FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit [J].
Cho, Mannhee ;
Kim, Youngmin .
ELECTRONICS, 2021, 10 (22)
[6]  
Dubey A., 2020, Int. J. Electr. Eng. Technol, V11, P354
[7]  
Fukagai T, 2018, IEEE IMAGE PROC, P301, DOI 10.1109/ICIP.2018.8451814
[8]   Res2Net: A New Multi-Scale Backbone Architecture [J].
Gao, Shang-Hua ;
Cheng, Ming-Ming ;
Zhao, Kai ;
Zhang, Xin-Yu ;
Yang, Ming-Hsuan ;
Torr, Philip .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) :652-662
[9]   On the Resilience of Deep Learning for Reduced-voltage FPGAs [J].
Givaki, Kamyar ;
Salami, Behzad ;
Hojabr, Reza ;
Reza Tayaranian, S. M. ;
Khonsari, Ahmad ;
Rahmati, Dara ;
Gorgin, Saeid ;
Cristal, Adrian ;
Unsal, Osman S. .
2020 28TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2020), 2020, :110-117
[10]  
Han J, 1995, LECT NOTES COMPUT SC, V930, P195