Multi-layer LSTM Parallel Optimization Based on Hardware and Software Cooperation

被引:0
作者
Chen, Qingfeng [1 ]
Wu, Jing [1 ]
Huang, Feihu [1 ]
Han, Yu [1 ]
Zhao, Qiming [1 ]
机构
[1] Wuhan Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430065, Peoples R China
来源
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II | 2022年 / 13369卷
关键词
LSTM; Software and hardware cooperation; Parallelism; RNN; NLP; COMPRESSION; PREDICTION; SYSTEMS;
D O I
10.1007/978-3-031-10986-7_55
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
LSTM's special gate structure and memory unit make it suitable for solving problems that are related to time series. It has excellent performance in the fields of machine translation and reasoning. However, LSTM also has some shortcomings, such as low parallelism, which leads to insufficient computing speed. Some existing optimization ideas only focus on one of the software and hardware. The former mostly focuses on model accuracy, and CPU accelerated LSTM doesn't dynamically adjust to network characteristics; While the latter can be based on the LSTM model structure. Customized accelerators are often limited by the structure of LSTM and cannot fully utilize the advantages of the hardware. This paper proposed a multi-layer LSTM optimization scheme based on the idea of software and hardware collaboration. We used the pruning by row scheme to greatly reduce the number of parameters while ensuring accuracy, making it adapt to the parallel structure of the hardware. From the perspective of software, the multi-layer LSTM module was analyzed. It was concluded that some neurons in different layers could be calculated in parallel. Therefore, this paper redesigned the computational order of the multilayer LSTM so that the model guaranteed its own timing properly and it was hardware friendly at the same time. Experiments showed that our throughput increased by 10x compared with the CPU implementation. Compared with other hardware accelerators, the throughput increased by 1.2x-1.4x, and the latency and resource utilization had also been improved.
引用
收藏
页码:681 / 693
页数:13
相关论文
共 30 条
  • [1] POLAR: A Pipelined/Overlapped FPGA-Based LSTM Accelerator
    Bank-Tavakoli, Erfan
    Ghasemzadeh, Seyed Abolfazl
    Kamal, Mehdi
    Afzali-Kusha, Ali
    Pedram, Massoud
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2020, 28 (03) : 838 - 842
  • [2] Chanana Ashish, 2017, 2017 42nd International Conference on Infrared, Millimeter and Terahertz Waves (IRMMW-THz), DOI 10.1109/IRMMW-THz.2017.8067214
  • [3] Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
  • [4] Ferreira JC, 2016, PROC INT CONF RECON
  • [5] Efficiency-Aware Workload Optimizations of Heterogeneous Cloud Computing for Capacity Planning in Financial Industry
    Gai, Keke
    Du, Zhihua
    Qiu, Meikang
    Zhao, Hui
    [J]. 2015 IEEE 2ND INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (CSCLOUD), 2015, : 1 - 6
  • [6] Learning to forget: Continual prediction with LSTM
    Gers, FA
    Schmidhuber, J
    Cummins, F
    [J]. NEURAL COMPUTATION, 2000, 12 (10) : 2451 - 2471
  • [7] Framewise phoneme classification with bidirectional LSTM and other neural network architectures
    Graves, A
    Schmidhuber, J
    [J]. NEURAL NETWORKS, 2005, 18 (5-6) : 602 - 610
  • [8] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
  • [9] Guan YJ, 2017, ASIA S PACIF DES AUT, P629, DOI 10.1109/ASPDAC.2017.7858394
  • [10] Han S, 2016, Arxiv, DOI arXiv:1510.00149