Scaling Up Vision-Language Pre-training for Image Captioning

被引:99
作者
Hu, Xiaowei [1 ]
Gan, Zhe [1 ]
Wang, Jianfeng [1 ]
Yang, Zhengyuan [1 ]
Liu, Zicheng [1 ]
Lu, Yumao [1 ]
Wang, Lijuan [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.01745
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, we have witnessed significant performance boost in the image captioning task based on vision-language pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON(sic), a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state-of-the-art VinVL model as our reference model, which consists of an image feature extractor and a transformer model, and scale the transformer both up and down, with model sizes ranging from 13 to 675 million parameters. In terms of data, we conduct experiments with up to 200 million image-text pairs which are automatically collected from web based on the alt attribute of the image (dubbed as ALT200M(1)). Extensive analysis helps to characterize the performance trend as the model size and the pre-training data size increase. We also compare different training recipes, especially for training on large-scale noisy data. As a result, LEMON achieves new state of the arts on several major image captioning benchmarks, including COCO Caption, nocaps, and Conceptual Captions. We also show LEMON can generate captions with long-tail visual concepts when used in a zero-shot manner.
引用
收藏
页码:17959 / 17968
页数:10
相关论文
共 50 条
[1]   nocaps: novel object captioning at scale [J].
Agrawal, Harsh ;
Desai, Karan ;
Wang, Yufei ;
Chen, Xinlei ;
Jain, Rishabh ;
Johnson, Mark ;
Batra, Dhruv ;
Parikh, Devi ;
Lee, Stefan ;
Anderson, Peter .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8947-8956
[2]  
[Anonymous], 2015, Microsoft COCO captions: Data collection and evaluation server
[3]  
[Anonymous], 2019, Neurips
[4]  
[Anonymous], 2020, AAAI
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]   SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation [J].
Cao, Jiale ;
Anwer, Rao Muhammad ;
Cholakkal, Hisham ;
Khan, Fahad Shahbaz ;
Pang, Yanwei ;
Shao, Ling .
COMPUTER VISION - ECCV 2020, PT XIV, 2020, 12359 :1-18
[7]  
Changpinyo Soravit, 2021, CVPR
[8]  
Chen Yen-Chun, 2020, ECCV
[9]  
Cho Jaemin, 2021, ICML
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171