Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

被引：738

作者：

Li, Xiujun ^{[1
,2
]}

Yin, Xi ^{[1
]}

Li, Chunyuan ^{[1
]}

Zhang, Pengchuan ^{[1
]}

Hu, Xiaowei ^{[1
]}

Zhang, Lei ^{[1
]}

Wang, Lijuan ^{[1
]}

Hu, Houdong ^{[1
]}

Dong, Li ^{[1
]}

Wei, Furu ^{[1
]}

Choi, Yejin ^{[2
]}

Gao, Jianfeng ^{[1
]}

机构：

[1] Microsoft Corp, Redmond, WA 98052 USA

[2] Univ Washington, Seattle, WA 98195 USA

来源：

COMPUTER VISION - ECCV 2020, PT XXX | 2020年 / 12375卷

关键词：

Object semantics; Vision-and-language; Pre-training;

D O I：

10.1007/978-3-030-58577-8_8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks (The code and pre-trained models are released: https://github.com/microsoft/Oscar).

引用

页码：121 / 137

页数：17

共 41 条

[1]

Hudson DA, 2019, Arxiv, DOI arXiv:1902.09506

[2] nocaps: novel object captioning at scale [J].

Agrawal, Harsh ;

Desai, Karan ;

Wang, Yufei ;

Chen, Xinlei ;

Jain, Rishabh ;

Johnson, Mark ;

Batra, Dhruv ;

Parikh, Devi ;

Lee, Stefan ;

Anderson, Peter .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8947-8956

[3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[4]

BROWN PF, 1991, 29 ANN M ASS COMPUTA, P169

[5]

Chen WH, 2020, Arxiv, DOI arXiv:1910.03230

[6]

Chen YC, 2020, Arxiv, DOI [arXiv:1909.11740, DOI 10.48550/ARXIV.1909.11740]

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

Frome A., 2013, Advances in neural information processing systems, V26, P2121

[9] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].

Goyal, Yash ;

Khot, Tejas ;

Summers-Stay, Douglas ;

Batra, Dhruv ;

Parikh, Devi .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334

[10] Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training [J].

Hao, Weituo ;

Li, Chunyuan ;

Li, Xiujun ;

Carin, Lawrence ;

Gao, Jianfeng .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :13134-13143

← 1 2 3 4 5 →