Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

被引:127
作者
Cao, Shijie [1 ]
Zhang, Chen [2 ]
Yao, Zhuliang [3 ]
Xiao, Wencong [4 ]
Nie, Lanshun [1 ]
Zhan, Dechen [1 ]
Liu, Yunxin [2 ]
Wu, Ming [2 ]
Zhang, Lintao [2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Microsoft Res, Redmond, WA USA
[3] Tsinghua Univ, Beijing, Peoples R China
[4] Beihang Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 2019 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'19) | 2019年
关键词
FPGA; Deep Neural Networks; LSTM; Weight Pruning; Inference; Bank-Balanced Sparsity;
D O I
10.1145/3289602.3293898
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Neural networks based on Long Short-Term Memory (LSTM) are widely deployed in latency-sensitive language and speech applications. To speed up LSTM inference, previous research proposes weight pruning techniques to reduce computational cost. Unfortunately, irregular computation and memory accesses in unrestricted sparse LSTM limit the realizable parallelism, especially when implemented on FPGA. To address this issue, some researchers propose block-based sparsity patterns to increase the regularity of sparse weight matrices, but these approaches suffer from deteriorated prediction accuracy. This work presents Bank-Balanced Sparsity (BBS), a novel sparsity pattern that can maintain model accuracy at a high sparsity level while still enable an efficient FPGA implementation. BBS partitions each weight matrix row into banks for parallel computing, while adopts fine-grained pruning inside each bank to maintain model accuracy. We develop a 3-step software-hardware co-optimization approach to apply BBS in real FPGA hardware. First, we propose a bank-balanced pruning method to induce the BBS pattern on weight matrices. Then we introduce a decoding-free sparse matrix format, Compressed Sparse Banks (CSB), that transparently exposes inter-bank parallelism in BBS to hardware. Finally, we design an FPGA accelerator that takes advantage of BBS to eliminate irregular computation and memory accesses. Implemented on Intel Arria-10 FPGA, the BBS accelerator can achieve 750.9 GOPs on sparse LSTM networks with a batch size of 1. Compared to stateof-the-art FPGA accelerators for LSTM with different compression techniques, the BBS accelerator achieves 2.3 similar to 3.7x improvement on energy efficiency and 7.0 similar to 34.4x reduction on latency with negligible loss of model accuracy.
引用
收藏
页码:63 / 72
页数:10
相关论文
共 30 条
  • [21] Mao Huizi, 2017, ARXIV PREPRINT ARXIV
  • [22] Raman G, 2014, EVID-BASED COMPL ALT, V20, pA66
  • [23] Undersander E., 2017, CORR
  • [24] Wang S., 2018, P ACM SIGDA INT S FI, P11, DOI DOI 10.1145/3174243.3174253
  • [25] Wen W, 2016, ADV NEUR IN, V29
  • [26] Efficient Implementations of Multi-pumped Multi-port Register Files in FPGAs
    Yantir, Hasan Erdem
    Bayar, Salih
    Yurdakul, Arda
    [J]. 16TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD 2013), 2013, : 185 - 192
  • [27] Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
    Yu, Jiecao
    Lukefahr, Andrew
    Palframan, David
    Dasika, Ganesh
    Das, Reetuparna
    Mahlke, Scott
    [J]. 44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2017), 2017, : 548 - 560
  • [28] Zaremba W, 2014, ARXIV
  • [29] Zhang Chen, 2015, P 2015 ACMSIGDA INT, P161, DOI [10.1145/2684746.2689060, DOI 10.1145/2684746.2689060]
  • [30] FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering
    Zhou, Shijie
    Kannan, Rajgopal
    Min, Yu
    Prasanna, Viktor K.
    [J]. PROCEEDINGS OF THE 2018 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'18), 2018, : 259 - 268