VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

被引:91
作者
Chen, Jun [1 ]
Guo, Han [2 ]
Yi, Kai [1 ]
Li, Boyang [3 ]
Elhoseiny, Mohamed [1 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Nanyang Technol Univ, Singapore, Singapore
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52688.2022.01750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose VisualGPT, which employs a novel self-resurrecting encoderdecoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data. The proposed self-resurrecting activation unit produces sparse activations that prevent accidental overwriting of linguistic knowledge. When trained on 0.1%, 0.5% and 1% of the respective training sets, VisualGPT surpasses the best baseline by up to 10.0% CIDEr on MS COCO [43] and 17.9% CIDEr on Conceptual Captions [63]. Furthermore, VisualGPT achieves the state-of-the-art result on IU X-ray [15], a medical report generation dataset. Our code is available at https : // github.com/vision-CAIR/VisualGPT.
引用
收藏
页码:18009 / 18019
页数:11
相关论文
共 83 条
[1]   nocaps: novel object captioning at scale [J].
Agrawal, Harsh ;
Desai, Karan ;
Wang, Yufei ;
Chen, Xinlei ;
Jain, Rishabh ;
Johnson, Mark ;
Batra, Dhruv ;
Parikh, Devi ;
Lee, Stefan ;
Anderson, Peter .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8947-8956
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00850
[4]  
[Anonymous], 2010, CVPR
[5]  
[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.130
[6]  
[Anonymous], 2019, NEURIPS, DOI DOI 10.1080/10495398.2019.1653901
[7]  
[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.187
[8]  
[Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01264-9_42
[9]  
[Anonymous], 2011, P 15 C COMP NAT LANG
[10]  
[Anonymous], 2001, J MACHINE LEARNING R