Multi-layer LSTM Parallel Optimization Based on Hardware and Software Cooperation

被引：0

作者：

Chen, Qingfeng ^{[1
]}

Wu, Jing ^{[1
]}

Huang, Feihu ^{[1
]}

Han, Yu ^{[1
]}

Zhao, Qiming ^{[1
]}

机构：

[1] Wuhan Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430065, Peoples R China

来源：

KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II | 2022年 / 13369卷

关键词：

LSTM; Software and hardware cooperation; Parallelism; RNN; NLP; COMPRESSION; PREDICTION; SYSTEMS;

D O I：

10.1007/978-3-031-10986-7_55

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

LSTM's special gate structure and memory unit make it suitable for solving problems that are related to time series. It has excellent performance in the fields of machine translation and reasoning. However, LSTM also has some shortcomings, such as low parallelism, which leads to insufficient computing speed. Some existing optimization ideas only focus on one of the software and hardware. The former mostly focuses on model accuracy, and CPU accelerated LSTM doesn't dynamically adjust to network characteristics; While the latter can be based on the LSTM model structure. Customized accelerators are often limited by the structure of LSTM and cannot fully utilize the advantages of the hardware. This paper proposed a multi-layer LSTM optimization scheme based on the idea of software and hardware collaboration. We used the pruning by row scheme to greatly reduce the number of parameters while ensuring accuracy, making it adapt to the parallel structure of the hardware. From the perspective of software, the multi-layer LSTM module was analyzed. It was concluded that some neurons in different layers could be calculated in parallel. Therefore, this paper redesigned the computational order of the multilayer LSTM so that the model guaranteed its own timing properly and it was hardware friendly at the same time. Experiments showed that our throughput increased by 10x compared with the CPU implementation. Compared with other hardware accelerators, the throughput increased by 1.2x-1.4x, and the latency and resource utilization had also been improved.

引用

页码：681 / 693

页数：13

共 30 条

[1] POLAR: A Pipelined/Overlapped FPGA-Based LSTM Accelerator
Bank-Tavakoli, Erfan
Ghasemzadeh, Seyed Abolfazl
Kamal, Mehdi
Afzali-Kusha, Ali
Pedram, Massoud
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2020, 28 (03) : 838 - 842
[2] Chanana Ashish, 2017, 2017 42nd International Conference on Infrared, Millimeter and Terahertz Waves (IRMMW-THz), DOI 10.1109/IRMMW-THz.2017.8067214
[3] Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[4] Ferreira JC, 2016, PROC INT CONF RECON
[5] Efficiency-Aware Workload Optimizations of Heterogeneous Cloud Computing for Capacity Planning in Financial Industry
Gai, Keke
Du, Zhihua
Qiu, Meikang
Zhao, Hui
[J]. 2015 IEEE 2ND INTERNATIONAL CONFERENCE ON CYBER SECURITY AND CLOUD COMPUTING (CSCLOUD), 2015, : 1 - 6
[6] Learning to forget: Continual prediction with LSTM
Gers, FA
Schmidhuber, J
Cummins, F
[J]. NEURAL COMPUTATION, 2000, 12 (10) : 2451 - 2471
[7] Framewise phoneme classification with bidirectional LSTM and other neural network architectures
Graves, A
Schmidhuber, J
[J]. NEURAL NETWORKS, 2005, 18 (5-6) : 602 - 610
[8] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
[9] Guan YJ, 2017, ASIA S PACIF DES AUT, P629, DOI 10.1109/ASPDAC.2017.7858394
[10] Han S, 2016, Arxiv, DOI arXiv:1510.00149

← 1 2 3 →