CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*

被引:29
作者
Luo, Jianjie [1 ,2 ]
Li, Yehao [3 ]
Pan, Yingwei [3 ]
Yao, Ting [3 ]
Chao, Hongyang [1 ,2 ]
Mei, Tao [3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Sun Yat Sen Univ, Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
[3] JD AI Res, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年
基金
国家重点研发计划;
关键词
Vision-language pre-training; Video understanding; Contrastive; learning; Video captioning; Cross-modal retrieval;
D O I
10.1145/3474085.3475703
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for videolanguage pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing intermodal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.
引用
收藏
页码:5600 / 5608
页数:9
相关论文
共 55 条
[1]   Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning [J].
Aafaq, Nayyer ;
Akhtar, Naveed ;
Liu, Wei ;
Gilani, Syed Zulqarnain ;
Mian, Ajmal .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12479-12488
[2]  
[Anonymous], 2017, ACM MM, DOI DOI 10.1145/3123266.3123448
[3]  
Bachman P, 2019, ADV NEUR IN, V32
[4]  
Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65, DOI DOI 10.3115/1626355.1626389
[5]  
Cai Q., 2020, NEURIPS
[6]  
Chen David L., 2011, P 49 ANN M ASS COMP
[7]  
Chen JW, 2019, AAAI CONF ARTIF INTE, P8167
[8]  
Chen T, 2020, PR MACH LEARN RES, V119
[9]  
Chen Yen-Chun, 2020, ECCV
[10]   Multi-fiber Networks for Video Recognition [J].
Chen, Yunpeng ;
Kalantidis, Yannis ;
Li, Jianshu ;
Yan, Shuicheng ;
Feng, Jiashi .
COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :364-380