Latency-aware automatic CNN channel pruning with GPU runtime analysis

被引:0
作者
Liu J. [1 ]
Sun J. [1 ]
Xu Z. [1 ]
Sun G. [1 ]
机构
[1] School of Computer Science and Technology, University of Science and Technology of China, Hefei
来源
BenchCouncil Transactions on Benchmarks, Standards and Evaluations | 2021年 / 1卷 / 01期
基金
中国国家自然科学基金;
关键词
Channel pruning; Convolutional neural network; GPU runtime analysis; Inference latency;
D O I
10.1016/j.tbench.2021.100009
中图分类号
学科分类号
摘要
The huge storage and computation cost of convolutional neural networks (CNN) make them challenging to meet the real-time inference requirement in many applications. Existing channel pruning methods mainly focus on removing unimportant channels in a CNN model based on rule-of-thumb designs, using reduced floating-point operations (FLOPs) and parameter numbers to measure the pruning quality. The inference latency of pruned models is often overlooked. In this paper, we propose a latency-aware automatic CNN channel pruning method (LACP), which aims to search low latency and accurate pruned network structure automatically. We evaluate the inaccuracy of measuring pruning quality by FLOPs and the number of parameters, and use the model inference latency as the direct optimization metric. To bridge model pruning and inference acceleration, we analyze the inference latency of convolutional layers on GPU. Results show that the inference latency of convolutional layers exhibits a staircase pattern along with channel number due to the GPU tail effect. Based on that observation, we greatly shrink the search space of network structures. Then we apply an evolutionary procedure to search a computationally efficient pruned network structure, which reduces the inference latency and maintains the model accuracy. Experiments and comparisons with state-of-the-art methods on three image classification datasets show that our method can achieve better inference acceleration with less accuracy loss. © 2022 The Authors
引用
收藏
相关论文
共 28 条
[1]  
pp. 770-778, (2016)
[2]  
pp. 580-587, (2014)
[3]  
pp. 3431-3440, (2015)
[4]  
Hubara I., Courbariaux M., Soudry D., El-Yaniv R., Bengio Y., Quantized neural networks: Training neural networks with low precision weights and activations, J. Mach. Learn. Res., 18, 1, pp. 6869-6898, (2017)
[5]  
pp. 2859-2868, (2019)
[6]  
LeCun Y., Denker J.S., Solla S.A., Howard R.E., Jackel L.D., Optimal brain damage, NIPs, Vol. 2, pp. 598-605, (1989)
[7]  
Han S., Pool J., Tran J., Dally W., Learning both weights and connections for efficient neural network, Advances in Neural Information Processing Systems, Vol. 28, (2015)
[8]  
pp. 1389-1397, (2017)
[9]  
Li H., Kadav A., Durdanovic I., Samet H., Graf H.P., Pruning filters for efficient convnets, (2016)
[10]  
pp. 2082-2090, (2016)