Compressing Visual-linguistic Model via Knowledge Distillation

被引:24
作者
Fang, Zhiyuan [1 ]
Wang, Jianfeng [2 ]
Hu, Xiaowei [2 ]
Wang, Lijuan [2 ]
Yang, Yezhou [1 ]
Liu, Zicheng [2 ]
机构
[1] Arizona State Univ, Tempe, AZ 85287 USA
[2] Microsoft Corp, Redmond, WA 98052 USA
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
基金
美国国家科学基金会;
关键词
LANGUAGE;
D O I
10.1109/ICCV48922.2021.00146
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.
引用
收藏
页码:1408 / 1418
页数:11
相关论文
共 11 条
  • [1] Globalese: a new visual-linguistic register
    Jaworski, Adam
    SOCIAL SEMIOTICS, 2015, 25 (02) : 217 - 235
  • [2] Dense Contrastive Visual-Linguistic Pretraining
    Shi, Lei
    Shuang, Kai
    Geng, Shijie
    Gao, Peng
    Fu, Zuohui
    de Melo, Gerard
    Chen, Yunpeng
    Su, Sen
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5203 - 5212
  • [3] Boosting Generic Visual-Linguistic Representation With Dynamic Contexts
    Ma, Guoqing
    Bai, Yalong
    Zhang, Wei
    Yao, Ting
    Shihada, Basem
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8445 - 8457
  • [4] Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation
    Tu, Tao
    Ping, Qing
    Thattai, Govindarajan
    Tur, Gokhan
    Natarajan, Prem
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5618 - 5627
  • [5] Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction
    Liu, Yi
    Pan, Junwen
    Wang, Qilong
    Chen, Guanlin
    Nie, Weiguo
    Zhang, Yudong
    Gao, Qian
    Hu, Qinghua
    Zhu, Pengfei
    ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 156 - 169
  • [6] Triangle-Reward Reinforcement Learning: Visual-Linguistic Semantic Alignment for Image Captioning
    Nie, Weizhi
    Li, Jiesi
    Xu, Ning
    Liu, An-An
    Li, Xuanya
    Zhang, Yongdong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4510 - 4518
  • [7] Visual perception and linguistic abilities, not quantitative knowledge, count in geometric knowledge of kindergarten children
    Liu, Kaichun
    Huang, Xiaohan
    Yang, Xiujie
    COGNITIVE PROCESSING, 2023, 24 (04) : 563 - 574
  • [8] Vocabulary knowledge predicts individual differences in the integration of visual and linguistic constraints
    Kukona, Anuenue
    Gaziano, Olivia
    Bisson, Marie-Josee
    Jordan, Adrian
    LANGUAGE COGNITION AND NEUROSCIENCE, 2022, 37 (06) : 750 - 765
  • [9] Linguistic and Spatial Skills Predict Early Arithmetic Development via Counting Sequence Knowledge
    Zhang, Xiao
    Koponen, Tuire
    Raesaenen, Pekka
    Aunola, Kaisa
    Lerkkanen, Marja-Kristiina
    Nurmi, Jari-Erik
    CHILD DEVELOPMENT, 2014, 85 (03) : 1091 - 1107
  • [10] ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
    Chen, Yangyi
    Wang, Xingyao
    Lie, Manling
    Hoiem, Derek
    Ji, Heng
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13342 - 13357