Knowledge from the original network: restore a better pruned network with knowledge distillation

被引:28
作者
Chen, Liyang [1 ]
Chen, Yongquan [3 ]
Xi, Juntong [1 ]
Le, Xinyi [2 ,3 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Mech Engn, Shanghai, Peoples R China
[2] Shanghai Jiao Tong Univ, Dept Automat, Shanghai, Peoples R China
[3] Shenzhen Inst Artificial Intelligence & Robot Soc, Shenzhen, Peoples R China
关键词
Model compression; Network pruning; Knowledge distillation; Deep neural networks;
D O I
10.1007/s40747-020-00248-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To deploy deep neural networks to edge devices with limited computation and storage costs, model compression is necessary for the application of deep learning. Pruning, as a traditional way of model compression, seeks to reduce the parameters of model weights. However, when a deep neural network is pruned, the accuracy of the network will significantly decrease. The traditional way to decrease the accuracy loss is fine-tuning. When over many parameters are pruned, the pruned network's capacity is reduced heavily and cannot recover to high accuracy. In this paper, we apply the knowledge distillation strategy to abate the accuracy loss of pruned models. The original network of the pruned network was used as the teacher network, aiming to transfer the dark knowledge from the original network to the pruned sub-network. We have applied three mainstream knowledge distillation methods: response-based knowledge, feature-based knowledge, and relation-based knowledge (Gou et al. in Knowledge distillation: a survey. , 2020), and compare the result to the traditional fine-tuning method with grand-truth labels. Experiments have been done on the CIFAR100 dataset with several deep convolution neural network. Results show that the pruned network recovered by knowledge distillation with its original network performs better accuracy than it recovered by fine-tuning with sample labels. It has also been validated in this paper that the original network as the teacher performs better than differently structured networks with same accuracy as the teacher.
引用
收藏
页码:709 / 718
页数:10
相关论文
共 36 条
[1]   Reconciling modern machine-learning practice and the classical bias-variance trade-off [J].
Belkin, Mikhail ;
Hsu, Daniel ;
Ma, Siyuan ;
Mandal, Soumik .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2019, 116 (32) :15849-15854
[2]  
Chen Hanting, 2020, IEEE T NEURAL NETWOR
[3]  
Denil Misha, 2013, Advances in Neural Information Processing Systems, V26, DOI DOI 10.5555/2999792.2999852
[4]  
Frankle Jonathan, 2018, Tech. Rep., DOI DOI 10.1080/09593985.2019.1709234
[5]  
Furlanello T, 2018, PR MACH LEARN RES, V80
[6]   Knowledge Distillation: A Survey [J].
Gou, Jianping ;
Yu, Baosheng ;
Maybank, Stephen J. ;
Tao, Dacheng .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) :1789-1819
[7]  
Graf, 2017, INT C LEARN REPR ICC
[8]  
Guo YW, 2016, ADV NEUR IN, V29
[9]  
HAGIWARA M, 1993, IEEE IJCNN, P351
[10]  
Han S, 2015, ADV NEUR IN, V28