CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*

被引：29

作者：

Luo, Jianjie ^{[1
,2
]}

Li, Yehao ^{[3
]}

Pan, Yingwei ^{[3
]}

Yao, Ting ^{[3
]}

Chao, Hongyang ^{[1
,2
]}

Mei, Tao ^{[3
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] Sun Yat Sen Univ, Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China

[3] JD AI Res, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

国家重点研发计划;

关键词：

Vision-language pre-training; Video understanding; Contrastive; learning; Video captioning; Cross-modal retrieval;

D O I：

10.1145/3474085.3475703

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for videolanguage pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing intermodal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

引用

页码：5600 / 5608

页数：9

共 55 条

[41]

Su Weijie, 2020, INT C LEARN REPR

[42] VideoBERT: A Joint Model for Video and Language Representation Learning [J].

Sun, Chen ;

Myers, Austin ;

Vondrick, Carl ;

Murphy, Kevin ;

Schmid, Cordelia .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7463-7472

[43]

Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100

[44]

van den Oord Aaron<spacing, 2018, ARXIV

[45]

Vedantam R, 2015, PROC CVPR IEEE, P4566, DOI 10.1109/CVPR.2015.7299087

[46] Unsupervised Feature Learning via Non-Parametric Instance Discrimination [J].

Wu, Zhirong ;

Xiong, Yuanjun ;

Yu, Stella X. ;

Lin, Dahua .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3733-3742

[47] Discriminatively Embedded K-Means for Multi-view Clustering [J].

Xu, Jinglin ;

Han, Junwei ;

Nie, Feiping .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :5356-5364

[48] Describing Videos by Exploiting Temporal Structure [J].

Yao, Li ;

Torabi, Atousa ;

Cho, Kyunghyun ;

Ballas, Nicolas ;

Pal, Christopher ;

Larochelle, Hugo ;

Courville, Aaron .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4507-4515

[49]

Yao T, 2021, AAAI CONF ARTIF INTE, V35, P10656

[50] Hierarchy Parsing for Image Captioning [J].

Yao, Ting ;

Pan, Yingwei ;

Li, Yehao ;

Mei, Tao .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2621-2629

← 1 2 3 4 5 6 →