CLUE: Contrastive language-guided learning for referring video object segmentation

被引:2
作者
Gao, Qiqi [1 ]
Zhong, Wanjun [2 ]
Li, Jie [1 ]
Zhao, Tiejun [1 ]
机构
[1] Harbin Inst Technol, 92 Xida St, Harbin 150001, Heilongjiang, Peoples R China
[2] Sun Yat Sen Univ, 135 Xingangxi Rd, Guangzhou 510275, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Video object segmentation; Multi-modal; Contrastive learning; Deep learning;
D O I
10.1016/j.patrec.2023.12.017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring video object segmentation (R-VOS), the task of separating the object described by a natural language query from the video frames, has become increasingly critical with recent advances in multi-modal understanding. Existing approaches are mainly visual-dominant in both representation-learning and decision-making process, and are less sensitive to fine-grained clues in text description. To address this, we propose a language-guided contrastive learning and data augmentation framework to enhance the model sensitivity to the fine-grained textual clues (i.e., color, location, subject) in the text that relate heavily to the video information. By substituting key information of the original sentences and paraphrasing them with a text-based generation model, our approach conducts contrastive learning through automatically building diverse and fluent contrastive samples. We further enhance the multi-modal alignment with a sparse attention mechanism, which can find the most relevant video information by optimal transport. Experiments on a large-scale R-VOS benchmark show that our method significantly improves strong Transformer-based baselines, and further analysis demonstrates the better ability of our model in identifying textual semantics.
引用
收藏
页码:115 / 121
页数:7
相关论文
共 36 条
[1]   Sequence-to-Sequence Contrastive Learning for Text Recognition [J].
Aberdam, Aviad ;
Litman, Ron ;
Tsiper, Shahar ;
Anschel, Oron ;
Slossberg, Ron ;
Mazor, Shai ;
Manmatha, R. ;
Perona, Pietro .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15297-15307
[2]  
[Anonymous], 2006, P IEEE COMP SOC C CO, P1735
[3]  
Bellver M, 2020, Arxiv, DOI arXiv:2010.00263
[4]  
Botach A., 2021, arXiv
[5]   Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [J].
Chen, Yanbei ;
Xian, Yongqin ;
Koepke, A. Sophia ;
Shan, Ying ;
Akata, Zeynep .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :7012-7021
[6]   Vision-Language Transformer and Query Generation for Referring Segmentation [J].
Ding, Henghui ;
Liu, Chang ;
Wang, Suchen ;
Jiang, Xudong .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :16301-16310
[7]  
Ding Zihan, 2021, 3 LARG SCAL VID OBJ, P7
[8]  
Dozat T, 2017, Arxiv, DOI [arXiv:1611.01734, 10.48550/arXiv.1611.01734, DOI 10.48550/ARXIV.1611.01734]
[9]   Actor and Action Video Segmentation from a Sentence [J].
Gavrilyuk, Kirill ;
Ghodrati, Amir ;
Li, Zhenyang ;
Snoek, Cees G. M. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966
[10]   Referring Image Segmentation via Cross-Modal Progressive Comprehension [J].
Huang, Shaofei ;
Hui, Tianrui ;
Liu, Si ;
Li, Guanbin ;
Wei, Yunchao ;
Han, Jizhong ;
Liu, Luoqi ;
Li, Bo .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10485-10494