Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD

被引:3
作者
Kong, Shuhan [1 ]
Li, Liang [2 ]
Zhang, Beichen [2 ]
Wang, Wenyu
Jiang, Bin [1 ]
Yan, Chenggang [3 ]
Xu, Changhao [1 ]
机构
[1] Shandong Univ, Jinan, Shandong, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Hangzhou Dianzi Univ, Lishui Inst, Hangzhou, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
vision confusion; contrastive learning; reverse inference; pseudo-samples intervention; NETWORK;
D O I
10.1145/3581783.3612384
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Joint video moment retrieval (MR) and highlight detection (HD) aims to find relevant video moments according to the query text. Existing methods are fully supervised based on manual annotation, and their coarse multi-modal information interactions easily lose details about video and text. In addition, some tasks introduce weakly supervised learning with random masks, while the single masking forces the model to focus on masked words and ignore multi-modal contextual information. In view of this, we attempt weakly supervised joint tasks (MR+HD) and propose Dynamic Contrastive Learning with Pseudo-Sample Intervention (CPI) for better multi-modal video comprehension. First, we design pseudo-samples over random masks for a more efficient contrastive learning manner. We introduce a proportional sampling strategy for pseudo-samples to ensure the semantic difference between the pseudo-samples and the query text. This balances the over-reliance from single random mask to global text semantics and makes the model learn multi-modal context from each word fairly. Second, we design dynamic intervention contrastive loss to enhance the core feature-matching ability of the model dynamically. We add pseudo-sample intervention when negative proposals are close to positive proposals. This can help the model overcome the vision confusion phenomenon and achieve semantic similarity instead of word similarity. Extensive experiments demonstrate the effectiveness of CPI and the potential of weakly supervised joint tasks.
引用
收藏
页码:538 / 546
页数:9
相关论文
共 51 条
[1]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[2]  
Escorcia V, 2022, Arxiv, DOI arXiv:1907.12763
[3]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[4]   TALL: Temporal Activity Localization via Language Query [J].
Gao, Jiyang ;
Sun, Chen ;
Yang, Zhenheng ;
Nevatia, Ram .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5277-5285
[5]   Fast Video Moment Retrieval [J].
Gao, Junyu ;
Xu, Changsheng .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1503-1512
[6]  
Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261
[7]   Video2GIF: Automatic Generation of Animated GIFs from Video [J].
Gygli, Michael ;
Song, Yale ;
Cao, Liangliang .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1001-1009
[8]  
HaoWang Zheng-Jun Zha, 2022, Semantic and relation modulation for audio-visual event localization
[9]   Localizing Moments in Video with Natural Language [J].
Hendricks, Lisa Anne ;
Wang, Oliver ;
Shechtman, Eli ;
Sivic, Josef ;
Darrell, Trevor ;
Russell, Bryan .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5804-5813
[10]   Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation [J].
Huang, Jiabo ;
Liu, Yang ;
Gong, Shaogang ;
Jin, Hailin .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7179-7188