An FPGA-Based Microinstruction Sequence Driven Spaceborne Convolution Neural Network Accelerator

被引:0
作者
Guo Z.-B. [1 ]
Liu K. [1 ]
Hu H.-T. [1 ]
Li Y.-D. [1 ]
Qu Z.-X. [2 ]
机构
[1] School of Computer Science and Technology, Xidian University, Xi'an
[2] CAST-Xi'an Institute of Space Radio Technology, Xi'an
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2022年 / 45卷 / 10期
基金
中国国家自然科学基金;
关键词
CNN; FPGA; Microinstruction sequences; Microprocessor design; Remote sensing object detection;
D O I
10.11897/SP.J.1016.2022.02047
中图分类号
学科分类号
摘要
Recently, with the evolvement of space remote sensing technology, the main earth observation device has been gradually transitioning from the single-satellite to a constellation composed of light and small satellites. A constellation of several high-resolution satellites collects hundreds of TBs(Terabytes) of RSI(Remote Sensing Image) data every day. The traditional satellite-to-ground data transmission mechanism has been unable to match the massive remote sensing data processing. In-orbit satellites need to improve their data processing capabilities to deal with increasingly complex observation missions. Meanwhile, in the field of RSI processing, deep learning algorithms based on CNN(Convolutional Neural Network) have become the mainstream method due to their excellent performance. However, the computation-intensive and memory-intensive features have brought many challenges to the deployment of CNN. Academia and industry propose many specific acceleration methods for the CNN domain to cope with the various application scenarios. Numerous FPGA(Field Programmable Gate Array) and ASIC(Application Specific Integrated Circuit) accelerators have been designed to accelerate CNN in edge and data center scenarios. Compared with ASIC, FPGA has higher flexibility and faster development iteration speed, making it very suitable for spaceborne scenarios. In this paper, we propose a microinstruction driven CNN Accelerator for RSI processing on FPGA. This accelerator is jointly designed by software and hardware, which mainly optimizes microinstruction coding, instruction-level parallelism(Coarse-Grained Parallelism) and operation-level parallelism(Fine-Grained Parallelism) under the constraints of limited storage bandwidth and computing resources on satellites. At software level, we propose an extensible microinstruction encoding format and the corresponding compilation method(Micro Assembler). A microinstruction code covers 14 instructions in 4 types, which can schedule the dataflow between different components of the accelerator. The micro assembler performs graph-level optimization on the CNN topology by convolutional loop tiling and operator fusion, and then generates micro-instruction sequences that can be executed by the accelerator. At hardware level, we design and implement an RTL(Register Transfer Level) CNN accelerator, which is mainly composed of micro controller and logic operator. The micro controller achieves the parallel execution of different types of instruction by a 5-stage coarse-grained pipeline(Data Load, Data Fetch, Compute, Post Process, Write Back). The logic operator is a computing array with DSP48E1 hard core resources cascaded, which can achieve parallel execution of convolution operations by a 32-stage fine-grained pipeline. When the pipeline is established, the logic operator can complete 32×32 MAC(Multiply-accumulate) operations in one clock cycle. The performance of our proposed accelerator is evaluated on the Xilinx VX690T FPGA chip commonly found on satellites. The designed power consumption is 10.68W. The RMT(Runtime Max Throughput) reaches 378.63GOP/s, and the ME(MAC Efficiency) reaches 91.5%. When our accelerator is used as a coprocessor to accelerate the CNN object detection algorithm YOLOV3Tiny, the average accuracy of the RSI data set reaches 0.9 and the detection speed reaches 102frames/s. The evaluation results show that our accelerator is 14 times more energy efficiency than the typical GPU acceleration method, and has more than 6.9% improvement in ME compared with other FPGA accelerators. © 2022, Science Press. All right reserved.
引用
收藏
页码:2047 / 2064
页数:17
相关论文
共 36 条
[1]  
Lecun Y, Bengio Y, Hinton G., Deep learning, Nature, 521, 7553, pp. 436-444, (2015)
[2]  
Krizhevsky A, Et al., ImageNet classification with deep convolutional neural networks, Communications of the ACM, 60, 6, pp. 84-90, (2017)
[3]  
Szegedy C, Et al., Going deeper with convolutions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, (2015)
[4]  
Jouppi N P, Young C, Patil N, Et al., In-datacenter performance analysis of a tensor processing unit, Proceedings of the Annual International Symposium on Computer Architecture, pp. 1-12, (2017)
[5]  
Chen Y, Luo T, Liu S, Et al., DaDianNao: A machine-learning supercomputer, Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609-622, (2014)
[6]  
Chen Y H, Yang T J, Emer J, Et al., Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9, 2, pp. 292-308, (2019)
[7]  
Gu Yi-Kun, Ni Feng-Lei, Liu Hong, Fault-tolerance design of Xilinx FPGA with self-hosting configuration management, Journal of Astronautics, 33, 10, pp. 1519-1527, (2012)
[8]  
Wang Chao, Wang Teng, Ma Xiang, Zhou Xue-Hai, Research progress on FPGA-based machine learning hardware acceleration, Chinese Journal of Computers, 43, 6, pp. 1161-1182, (2020)
[9]  
Wu Yan-Xia, Liang Kai, Liu Ying, Cui Hui-Min, The progress and trends of FPGA-based accelerators in deep learning, Chinese Journal of Computers, 42, 11, pp. 2461-2480, (2019)
[10]  
Geng T, Wang T, Sanaullah A, Et al., A framework for acceleration of CNN training on deeply-pipelined FPGA clusters with work and weight load balancing, Proceedings of the International Conference on Field Programmable Logic and Applications, pp. 394-398, (2018)