Distilling from professors: Enhancing the knowledge distillation of teachers

被引：16

作者：

Bang, Duhyeon ^{[2
]}

Lee, Jongwuk ^{[3
]}

Shim, Hyunjung ^{[1
]}

机构：

[1] SK Telecom, SK T Tower,65 Eulji Ro, Seoul, South Korea

[2] Sungkyunkwan Univ, Dept Software, 2066 Seobu Ro, Suwon, Gyeonggi Do, South Korea

[3] Yonsei Univ, Sch Integrated Technol, 85 Songdogwakak Ro, Incheon, South Korea

来源：

INFORMATION SCIENCES | 2021年 / 576卷

关键词：

Knowledge distillation; Professor model; Conditional adversarial autoencoder;

D O I：

10.1016/j.ins.2021.08.020

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Knowledge distillation (KD) is a successful technique for transferring knowledge from one machine learning model to another model. Specifically, the idea of KD has been widely used for various tasks such as model compression and knowledge transfer between different models. However, existing studies in KD have overlooked the possibility that dark knowledge (i.e., soft targets) obtained from a complex and large model (a.k.a., a teacher model) may be either incorrect or insufficient. Such knowledge can hinder the effective learning of another small model (a.k.a., a student model). In this paper, we propose the professor model, which refines the soft target from the teacher model to improve KD. The professor model aims to achieve two goals; 1) improving the prediction accuracy and 2) capturing the inter-class correlation of the soft target from the teacher model. We first design the professor model by reformulating a conditional adversarial autoencoder (CAAE). Then, we devise two KD strategies using both teacher and professor models. Our empirical study demonstrates that the professor model effectively improves KD in three benchmark datasets: CIFAR100, TinyImagenet, and ILSVRC2015. Moreover, our comprehensive analysis shows that the professor model is much more effective than employing the stronger teacher model, in which parameters are greater than the sum of the teacher's and professor's parameters. Since the proposed model is model-agnostic, our model can be combined with any KD algorithm and consistently improves various KD techniques. (c) 2021 Elsevier Inc. All rights reserved.

引用

页码：743 / 755

页数：13

共 35 条

[1]

[Anonymous], 2019, CoRR

[2]

Baldi P., 2012, P ICML WORKSH UNS TR, P37

[3] Adversarial Network Compression [J].

Belagiannis, Vasileios ;

Farshad, Azade ;

Galasso, Fabio .

COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 :431-449

[4]

Caruana R, 1997, ADV NEUR IN, V9, P389

[5]

Chen H, 2018, AAAI CONF ARTIF INTE, P2836

[6] Equivalence among Stochastic Logic Circuits and its Application [J].

Chen, Te-Hsuan ;

Hayes, John P. .

2015 52ND ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2015,

[7] On the Efficacy of Knowledge Distillation [J].

Cho, Jang Hyun ;

Hariharan, Bharath .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4793-4801

[8] Block change learning for knowledge distillation [J].

Choi, Hyunguk ;

Lee, Younkwan ;

Yow, Kin Choong ;

Jeon, Moongu .

INFORMATION SCIENCES, 2020, 513 (513) :360-371

[9]

Chrabaszcz P., 2017, CORR

[10]

Furlanello Tommaso, 2018, PMLR, V5

← 1 2 3 4 →