Compressing Visual-linguistic Model via Knowledge Distillation

被引：29

作者：

Fang, Zhiyuan ^{[1
]}

Wang, Jianfeng ^{[2
]}

Hu, Xiaowei ^{[2
]}

Wang, Lijuan ^{[2
]}

Yang, Yezhou ^{[1
]}

Liu, Zicheng ^{[2
]}

机构：

[1] Arizona State Univ, Tempe, AZ 85287 USA

[2] Microsoft Corp, Redmond, WA 98052 USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

美国国家科学基金会;

关键词：

LANGUAGE;

D O I：

10.1109/ICCV48922.2021.00146

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.

引用

页码：1408 / 1418

页数：11

共 76 条

[1] Variational Information Distillation for Knowledge Transfer [J].

Ahn, Sungsoo ;

Hu, Shell Xu ;

Damianou, Andreas ;

Lawrence, Neil D. ;

Dai, Zhenwen .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9155-9163

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[4]

[Anonymous], 2020, P IEEE C COMP VIS PA, DOI DOI 10.1109/BIBM49941.2020.9313406

[5]

[Anonymous], 2019, P IEEE CVF C COMP VI

[6]

[Anonymous], 2020, ECCV, DOI DOI 10.1109/QCE49297.2020.00054

[7] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[8]

Brown TB, 2020, ADV NEUR IN, V33

[9]

Bucilua Cristian, 2006, P 12 ACM SIGKDD INT, P535

[10] Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [J].

Cao, Jize ;

Gan, Zhe ;

Cheng, Yu ;

Yu, Licheng ;

Chen, Yen-Chun ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT VI, 2020, 12351 :565-580

← 1 2 3 4 5 6 7 8 →