VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

被引:96
作者
Chen, Jun [1 ]
Guo, Han [2 ]
Yi, Kai [1 ]
Li, Boyang [3 ]
Elhoseiny, Mohamed [1 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Nanyang Technol Univ, Singapore, Singapore
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52688.2022.01750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose VisualGPT, which employs a novel self-resurrecting encoderdecoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data. The proposed self-resurrecting activation unit produces sparse activations that prevent accidental overwriting of linguistic knowledge. When trained on 0.1%, 0.5% and 1% of the respective training sets, VisualGPT surpasses the best baseline by up to 10.0% CIDEr on MS COCO [43] and 17.9% CIDEr on Conceptual Captions [63]. Furthermore, VisualGPT achieves the state-of-the-art result on IU X-ray [15], a medical report generation dataset. Our code is available at https : // github.com/vision-CAIR/VisualGPT.
引用
收藏
页码:18009 / 18019
页数:11
相关论文
共 83 条
[11]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00138
[12]  
[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.681
[13]  
[Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-04070-39
[14]  
[Anonymous], 2018, NEURIPS
[15]  
[Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01246-5_31
[16]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.01094
[17]  
Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1
[18]   Embedding a cluster-based overlay mesh in mobile ad hoc networks without cluster heads [J].
Banerjee, A ;
King, CT ;
Hsiao, HC .
2005 International Conference on Parallel Processsing, Proceedings, 2005, :49-56
[19]  
Brown TB, 2020, ADV NEUR IN, V33
[20]  
Chen L. C., 2017, RETHINKING ATROUS CO