Clover : Towards A Unified Video-Language Alignment and Fusion Model

被引:14
作者
Huang, Jingjia [2 ]
Li, Yinan [1 ]
Feng, Jiashi [2 ]
Wu, Xinglong [2 ]
Sun, Xiaoshuai [1 ]
Ji, Rongrong [1 ]
机构
[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Co, Minist Educ China, Xiamen 361005, Peoples R China
[2] ByteDance Inc, Beijing 100043, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.01427
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Building a universal Video-Language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well align and fuse features from different modalities. We then introduce Clover-a Correlated Video-Language pre-training method-towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https: //github.com/LeeYN-43/Clover.
引用
收藏
页码:14856 / 14866
页数:11
相关论文
共 63 条
[1]  
[Anonymous], 2015, Microsoft COCO captions: Data collection and evaluation server
[2]  
[Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/SPIES55999.2022.10082039
[3]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[4]  
Bao H., 2021, INT C LEARN REPR
[5]  
Bird S., 2006, COL ACL 2006 21 INT
[6]  
Chen D., 2011, ACL, P190
[7]  
Cheng Xing, 2021, ARXIV210904290
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]   Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering [J].
Fan, Chenyou ;
Zhang, Xiaofan ;
Zhang, Shu ;
Wang, Wensheng ;
Zhang, Chi ;
Huang, Heng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1999-2007
[10]  
Fu Tsu-Jui, 2021, ARXIV211112681