CLUE: Contrastive language-guided learning for referring video object segmentation

被引：2

作者：

Gao, Qiqi ^{[1
]}

Zhong, Wanjun ^{[2
]}

Li, Jie ^{[1
]}

Zhao, Tiejun ^{[1
]}

机构：

[1] Harbin Inst Technol, 92 Xida St, Harbin 150001, Heilongjiang, Peoples R China

[2] Sun Yat Sen Univ, 135 Xingangxi Rd, Guangzhou 510275, Guangdong, Peoples R China

来源：

PATTERN RECOGNITION LETTERS | 2024年 / 178卷

基金：

中国国家自然科学基金;

关键词：

Video object segmentation; Multi-modal; Contrastive learning; Deep learning;

D O I：

10.1016/j.patrec.2023.12.017

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Referring video object segmentation (R-VOS), the task of separating the object described by a natural language query from the video frames, has become increasingly critical with recent advances in multi-modal understanding. Existing approaches are mainly visual-dominant in both representation-learning and decision-making process, and are less sensitive to fine-grained clues in text description. To address this, we propose a language-guided contrastive learning and data augmentation framework to enhance the model sensitivity to the fine-grained textual clues (i.e., color, location, subject) in the text that relate heavily to the video information. By substituting key information of the original sentences and paraphrasing them with a text-based generation model, our approach conducts contrastive learning through automatically building diverse and fluent contrastive samples. We further enhance the multi-modal alignment with a sparse attention mechanism, which can find the most relevant video information by optimal transport. Experiments on a large-scale R-VOS benchmark show that our method significantly improves strong Transformer-based baselines, and further analysis demonstrates the better ability of our model in identifying textual semantics.

引用

页码：115 / 121

页数：7

共 36 条

[1] Sequence-to-Sequence Contrastive Learning for Text Recognition [J].

Aberdam, Aviad ;

Litman, Ron ;

Tsiper, Shahar ;

Anschel, Oron ;

Slossberg, Ron ;

Mazor, Shai ;

Manmatha, R. ;

Perona, Pietro .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15297-15307

[2]

[Anonymous], 2006, P IEEE COMP SOC C CO, P1735

[3]

Bellver M, 2020, Arxiv, DOI arXiv:2010.00263

[4]

Botach A., 2021, arXiv

[5] Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [J].

Chen, Yanbei ;

Xian, Yongqin ;

Koepke, A. Sophia ;

Shan, Ying ;

Akata, Zeynep .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :7012-7021

[6] Vision-Language Transformer and Query Generation for Referring Segmentation [J].

Ding, Henghui ;

Liu, Chang ;

Wang, Suchen ;

Jiang, Xudong .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :16301-16310

[7]

Ding Zihan, 2021, 3 LARG SCAL VID OBJ, P7

[8]

Dozat T, 2017, Arxiv, DOI [arXiv:1611.01734, 10.48550/arXiv.1611.01734, DOI 10.48550/ARXIV.1611.01734]

[9] Actor and Action Video Segmentation from a Sentence [J].

Gavrilyuk, Kirill ;

Ghodrati, Amir ;

Li, Zhenyang ;

Snoek, Cees G. M. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966

[10] Referring Image Segmentation via Cross-Modal Progressive Comprehension [J].

Huang, Shaofei ;

Hui, Tianrui ;

Liu, Si ;

Li, Guanbin ;

Wei, Yunchao ;

Han, Jizhong ;

Liu, Luoqi ;

Li, Bo .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10485-10494

← 1 2 3 4 →