Compressing Visual-linguistic Model via Knowledge Distillation

被引：24

作者：

Fang, Zhiyuan ^{[1
]}

Wang, Jianfeng ^{[2
]}

Hu, Xiaowei ^{[2
]}

Wang, Lijuan ^{[2
]}

Yang, Yezhou ^{[1
]}

Liu, Zicheng ^{[2
]}

机构：

[1] Arizona State Univ, Tempe, AZ 85287 USA

[2] Microsoft Corp, Redmond, WA 98052 USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

美国国家科学基金会;

关键词：

LANGUAGE;

D O I：

10.1109/ICCV48922.2021.00146

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.

引用

页码：1408 / 1418

页数：11

共 11 条

[1] Globalese: a new visual-linguistic register
Jaworski, Adam
SOCIAL SEMIOTICS, 2015, 25 (02) : 217 - 235
[2] Dense Contrastive Visual-Linguistic Pretraining
Shi, Lei
Shuang, Kai
Geng, Shijie
Gao, Peng
Fu, Zuohui
de Melo, Gerard
Chen, Yunpeng
Su, Sen
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5203 - 5212
[3] Boosting Generic Visual-Linguistic Representation With Dynamic Contexts
Ma, Guoqing
Bai, Yalong
Zhang, Wei
Yao, Ting
Shihada, Basem
Mei, Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8445 - 8457
[4] Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation
Tu, Tao
Ping, Qing
Thattai, Govindarajan
Tur, Gokhan
Natarajan, Prem
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5618 - 5627
[5] Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction
Liu, Yi
Pan, Junwen
Wang, Qilong
Chen, Guanlin
Nie, Weiguo
Zhang, Yudong
Gao, Qian
Hu, Qinghua
Zhu, Pengfei
ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 156 - 169
[6] Triangle-Reward Reinforcement Learning: Visual-Linguistic Semantic Alignment for Image Captioning
Nie, Weizhi
Li, Jiesi
Xu, Ning
Liu, An-An
Li, Xuanya
Zhang, Yongdong
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4510 - 4518
[7] Visual perception and linguistic abilities, not quantitative knowledge, count in geometric knowledge of kindergarten children
Liu, Kaichun
Huang, Xiaohan
Yang, Xiujie
COGNITIVE PROCESSING, 2023, 24 (04) : 563 - 574
[8] Vocabulary knowledge predicts individual differences in the integration of visual and linguistic constraints
Kukona, Anuenue
Gaziano, Olivia
Bisson, Marie-Josee
Jordan, Adrian
LANGUAGE COGNITION AND NEUROSCIENCE, 2022, 37 (06) : 750 - 765
[9] Linguistic and Spatial Skills Predict Early Arithmetic Development via Counting Sequence Knowledge
Zhang, Xiao
Koponen, Tuire
Raesaenen, Pekka
Aunola, Kaisa
Lerkkanen, Marja-Kristiina
Nurmi, Jari-Erik
CHILD DEVELOPMENT, 2014, 85 (03) : 1091 - 1107
[10] ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Chen, Yangyi
Wang, Xingyao
Lie, Manling
Hoiem, Derek
Ji, Heng
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13342 - 13357

← 1 2 →