VinVL: Revisiting Visual Representations in Vision-Language Models

被引:565
作者
Zhang, Pengchuan [1 ]
Li, Xiujun [1 ,2 ]
Hu, Xiaowei [1 ]
Yang, Jianwei [1 ]
Zhang, Lei [1 ]
Wang, Lijuan [1 ]
Choi, Yejin [2 ]
Gao, Jianfeng [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
[2] Univ Washington, Seattle, WA 98195 USA
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.00553
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pretrained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [20], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL.
引用
收藏
页码:5575 / 5584
页数:10
相关论文
共 42 条
[1]   nocaps: novel object captioning at scale [J].
Agrawal, Harsh ;
Desai, Karan ;
Wang, Yufei ;
Chen, Xinlei ;
Jain, Rishabh ;
Johnson, Mark ;
Batra, Dhruv ;
Parikh, Devi ;
Lee, Stefan ;
Anderson, Peter .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8947-8956
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]  
[Anonymous], 2019, Neurips
[4]  
[Anonymous], ECCV
[5]  
Duerig T, 2018, ARXIV181100982
[6]  
Faghri Fartash, 2017, arXiv
[7]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[8]  
Gan Zhe, 2020, ADV NEURAL INFORM PR
[9]   Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].
Goyal, Yash ;
Khot, Tejas ;
Summers-Stay, Douglas ;
Batra, Dhruv ;
Parikh, Devi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334
[10]  
Hu Xiaowei, 2020, ARXIV PREPRINT ARXIV