The Progress and Trends of FPGA-Based Accelerators in Deep Learning

被引:0
作者
Wu Y.-X. [1 ]
Liang K. [1 ,2 ]
Liu Y. [2 ]
Cui H.-M. [2 ]
机构
[1] College of Computer Science and Technology, Harbin Engineering University, Harbin
[2] State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2019年 / 42卷 / 11期
关键词
CPU-FPGA; Deep learning; FPGA; Hardware accelerator; Neural network;
D O I
10.11897/SP.J.1016.2019.02461
中图分类号
学科分类号
摘要
With the coming of big data era, the technology of deep learning plays a critical role in extracting the meaningful information from the massive data. Also, it has been widely applied in some domains, such as computer vision, speech recognition and natural language processing. As far as we all know, deep learning algorithms have a large number of parameters and relevant matrix multiplication or multiply-and-add operations with the simple computing model. At the same time, researches and industries need higher and higher accuracy, the complex models which have more and more weights won the image classification and object detection contest. To speed up the inference and training of deep learning becomes much more important. This paper mainly reviews one of the approaches, accelerating deep learning on FPGAs. Firstly, this paper introduces the deep learning algorithms and relative characteristics especially the convolutional neural network and recurrent neural network and why the FPGA approach can fit this problem, what people can be benefit compared with CPU only or GPU and what people should do to enable it. After that, this paper analyzes the challenges of accelerating deep learning on FPGAs. Additionally, the CPU-FPGA is the most welcome architecture at present, but there're different methods to set up a usable platform. And for CPU-FPGA platforms,data communication is one of the important factors for affecting the performance of acceleration. Based on these, this paper introduces different methods from the aspects of SoC FPGA and standard FPGA, and comparatively analyzes the differences of the data communication between CPU and FPGA of the two methods. For different engineers and developers, the development environments and tools of accelerating deep learning on FPGAs are presented in this paper from the aspects of high-level language,such as C and OpenCL, and hardware description language, such as Verilog. It is not hard to use techniques which require many complex low-level hardware control operations for FPGA implementations to improve performance by using hardware description. But that always requires a working knowledge of digital design and circuits. In contrary to hardware description language, using high-level language allows engineers and developers to have more freedom to explore algorithm instead of designing hardware architectures and it is suitable for researchers who need quickly iterate through design cycles and software developers who have no knowledge of digital design. And on the basis of it, the FPGA-based accelerator for deep learning algorithms is reviewed from the hardware architecture, design methods and optimization strategy in detail. Three typic hardware architectures are presented for convolutional neural network, and hardware architectures of two typic models for recurrent neural network are introduced. Design methods are reviewed from the goal of accelerator, the way to build model of algorithms and how to find the best solution. As for optimization strategy, it is presented from the processer and memory communication, which are critical factors for improving performance of Accelerator. Finally, this paper prospects the research on FPGA-accelerated deep learning algorithms from the aspects of the development environments, process communications between CPU and FPGA, better compression model methods and cloud application with FPGA. © 2019, Science Press. All right reserved.
引用
收藏
页码:2461 / 2480
页数:19
相关论文
共 75 条
[61]  
Zuo W., Liang Y., Li P., Et al., Improving high level synthesis optimization opportunity through polyhedral transformations, Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 9-18, (2013)
[62]  
Dally B., Power, programmability, and granularity: The challenges of exascale computing, Proceedings of the 2011 IEEE International Test Conference, (2011)
[63]  
Horowitz M., 1.1 Computing's energy problem (and what we can do about it), Proceedings of the 2014 IEEE International Solid-State Circuits Conference on Digest of Technical Papers (ISSCC), pp. 10-14, (2014)
[64]  
Hameed R., Qadeer W., Wachs M., Et al., Understanding sources of inefficiency in general-purpose chips, ACM SIGARCH Computer Architecture News, 38, 3, pp. 37-47, (2010)
[65]  
Courbariaux M., David J.-P., Bengio Y., Low precision storage for deep learning, (2014)
[66]  
Gupta S., Agrawal A., Gopalakrishnan K., Et al., Deep learning with limited numerical precision, (2015)
[67]  
Trinh H.-P., Duranton M., Paindavoine M., Efficient data encoding for convolutional neural network application, ACM Transactions on Architecture and Code Optimization (TACO), 11, 4, (2015)
[68]  
Anwar S., Hwang K., Sung W., Fixed point optimization of deep convolutional neural networks for object recognition, Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1131-1135, (2015)
[69]  
Jiang J., Hu R., Mikel L., Et al., Accuracy evaluation of deep belief networks with fixed-point arithmetic, Computer Modelling & New Technologies, 18, 6, pp. 7-14, (2014)
[70]  
Lian R.L., A Framework for FPGA-Based Acceleration of Neural Network Inference with Limited Numerical Precision via High-Level Synthesis with Streaming Functionality, (2016)