CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*

被引:29
作者
Luo, Jianjie [1 ,2 ]
Li, Yehao [3 ]
Pan, Yingwei [3 ]
Yao, Ting [3 ]
Chao, Hongyang [1 ,2 ]
Mei, Tao [3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Sun Yat Sen Univ, Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
[3] JD AI Res, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年
基金
国家重点研发计划;
关键词
Vision-language pre-training; Video understanding; Contrastive; learning; Video captioning; Cross-modal retrieval;
D O I
10.1145/3474085.3475703
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for videolanguage pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing intermodal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.
引用
收藏
页码:5600 / 5608
页数:9
相关论文
共 55 条
[41]  
Su Weijie, 2020, INT C LEARN REPR
[42]   VideoBERT: A Joint Model for Video and Language Representation Learning [J].
Sun, Chen ;
Myers, Austin ;
Vondrick, Carl ;
Murphy, Kevin ;
Schmid, Cordelia .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7463-7472
[43]  
Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100
[44]  
van den Oord Aaron<spacing, 2018, ARXIV
[45]  
Vedantam R, 2015, PROC CVPR IEEE, P4566, DOI 10.1109/CVPR.2015.7299087
[46]   Unsupervised Feature Learning via Non-Parametric Instance Discrimination [J].
Wu, Zhirong ;
Xiong, Yuanjun ;
Yu, Stella X. ;
Lin, Dahua .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3733-3742
[47]   Discriminatively Embedded K-Means for Multi-view Clustering [J].
Xu, Jinglin ;
Han, Junwei ;
Nie, Feiping .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :5356-5364
[48]   Describing Videos by Exploiting Temporal Structure [J].
Yao, Li ;
Torabi, Atousa ;
Cho, Kyunghyun ;
Ballas, Nicolas ;
Pal, Christopher ;
Larochelle, Hugo ;
Courville, Aaron .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4507-4515
[49]  
Yao T, 2021, AAAI CONF ARTIF INTE, V35, P10656
[50]   Hierarchy Parsing for Image Captioning [J].
Yao, Ting ;
Pan, Yingwei ;
Li, Yehao ;
Mei, Tao .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2621-2629