Rethinking Pruning for Accelerating Deep Inference At the Edge

被引:18
作者
Gao, Dawei [1 ,2 ]
He, Xiaoxi [3 ]
Zhou, Zimu [4 ]
Tong, Yongxin [1 ,2 ]
Xu, Ke [1 ,2 ]
Thiele, Lothar [3 ]
机构
[1] Beihang Univ, SKLSDE, Beijing, Peoples R China
[2] Beihang Univ, BDBC, Beijing, Peoples R China
[3] Swiss Fed Inst Technol, Zurich, Switzerland
[4] Singapore Management Univ, Singapore, Singapore
来源
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2020年
基金
美国国家科学基金会; 瑞士国家科学基金会;
关键词
Deep Learning; Sequence Labelling; Network Pruning; Automatic Speech Recognition; Name Entity Recognition; NEURAL-NETWORKS;
D O I
10.1145/3394486.3403058
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There is a growing trend to deploy deep neural networks at the edge for high-accuracy, real-time data mining and user interaction. Applications such as speech recognition and language understanding often apply a deep neural network to encode an input sequence and then use a decoder to generate the output sequence. A promising technique to accelerate these applications on resource-constrained devices is network pruning, which compresses the size of the deep neural network without severe drop in inference accuracy. However, we observe that although existing network pruning algorithms prove effective to speed up the prior deep neural network, they lead to dramatic slowdown of the subsequent decoding and may not always reduce the overall latency of the entire application. To rectify such drawbacks, we propose entropy-based pruning, a new regularizer that can be seamlessly integrated into existing network pruning algorithms. Our key theoretical insight is that reducing the information entropy of the deep neural network outputs decreases the upper bound of the subsequent decoding search space. We validate our solution with two state-of-the-art network pruning algorithms on two model architectures. Experimental results show that compared with existing network pruning algorithms, our entropy-based pruning method notably suppresses and even eliminates the increase of decoding time, and achieves shorter overall latency with only negligible extra accuracy loss in the applications.
引用
收藏
页码:155 / 164
页数:10
相关论文
共 31 条
[1]  
[Anonymous], 2013, ADV NEURAL INFORM PR
[2]   DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning [J].
Chen, Yunji ;
Chen, Tianshi ;
Xu, Zhiwei ;
Sun, Ninghui ;
Temam, Olivier .
COMMUNICATIONS OF THE ACM, 2016, 59 (11) :105-112
[3]  
Dai B, 2018, PR MACH LEARN RES, V80
[4]  
Dong X, 2017, ADV NEUR IN, V30
[5]  
Furui S., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4), P1991
[6]   Accelerating Mobile Audio Sensing Algorithms through On-Chip GPU Offloading [J].
Georgiev, Petko ;
Lane, Nicholas D. ;
Mascolo, Cecilia ;
Chu, David .
MOBISYS'17: PROCEEDINGS OF THE 15TH ANNUAL INTERNATIONAL CONFERENCE ON MOBILE SYSTEMS, APPLICATIONS, AND SERVICES, 2017, :306-318
[7]  
Graves A, 2012, STUD COMPUT INTELL, V385, P5
[8]  
Han S., 2016, INT C LEARN REPR ICL
[9]   EIE: Efficient Inference Engine on Compressed Deep Neural Network [J].
Han, Song ;
Liu, Xingyu ;
Mao, Huizi ;
Pu, Jing ;
Pedram, Ardavan ;
Horowitz, Mark A. ;
Dally, William J. .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :243-254
[10]  
Hannun A., 2014, CoRR, P1