VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

被引：91

作者：

Chen, Jun ^{[1
]}

Guo, Han ^{[2
]}

Yi, Kai ^{[1
]}

Li, Boyang ^{[3
]}

Elhoseiny, Mohamed ^{[1
]}

机构：

[1] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[3] Nanyang Technol Univ, Singapore, Singapore

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52688.2022.01750

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose VisualGPT, which employs a novel self-resurrecting encoderdecoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data. The proposed self-resurrecting activation unit produces sparse activations that prevent accidental overwriting of linguistic knowledge. When trained on 0.1%, 0.5% and 1% of the respective training sets, VisualGPT surpasses the best baseline by up to 10.0% CIDEr on MS COCO [43] and 17.9% CIDEr on Conceptual Captions [63]. Furthermore, VisualGPT achieves the state-of-the-art result on IU X-ray [15], a medical report generation dataset. Our code is available at https : // github.com/vision-CAIR/VisualGPT.

引用

页码：18009 / 18019

页数：11

共 83 条

[1] nocaps: novel object captioning at scale [J].

Agrawal, Harsh ;

Desai, Karan ;

Wang, Yufei ;

Chen, Xinlei ;

Jain, Rishabh ;

Johnson, Mark ;

Batra, Dhruv ;

Parikh, Devi ;

Lee, Stefan ;

Anderson, Peter .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8947-8956

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3]

[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00850

[4]

[Anonymous], 2010, CVPR

[5]

[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.130

[6]

[Anonymous], 2019, NEURIPS, DOI DOI 10.1080/10495398.2019.1653901

[7]

[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.187

[8]

[Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01264-9_42

[9]

[Anonymous], 2011, P 15 C COMP NAT LANG

[10]

[Anonymous], 2001, J MACHINE LEARNING R

← 1 2 3 4 5 6 7 8 9 →