SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

被引：0

作者：

Luo, Zhuoyan ^{[1
]}

Xiao, Yicheng ^{[1
]}

Liu, Yong ^{[1
]}

Li, Shuyan ^{[3
]}

Wang, Yitong ^{[2
]}

Tang, Yansong ^{[1
]}

Li, Xiu ^{[1
]}

Yang, Yujiu ^{[1
]}

机构：

[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[2] ByteDance Inc, Beijing, Peoples R China

[3] Univ Cambridge, Engn Dept, Cambridge, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code is available at https://github.com/RobertLuo1/NeurIPS2023_SOC.

引用

页数：13

共 48 条

[1]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00585

[2]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00412

[3] End-to-End Referring Video Object Segmentation with Multimodal Transformers [J].

Botach, Adam ;

Zheltonozhskii, Evgenii ;

Baskin, Chaim .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4975-4985

[4]

Carion N., 2020, ECCV, P213

[5]

CHEN SR, 2020, AAAI, P2152

[6]

Ding H., 2022, TPAMI

[7]

DING HH, 2021, ICCV, P1630, DOI DOI 10.1109/ICCV48922.2021.01601

[8]

DING HH, 2019, CVPR, P8877, DOI DOI 10.1109/CVPR.2019.00909

[9] Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation [J].

Ding, Zihan ;

Hui, Tianrui ;

Huang, Junshi ;

Wei, Xiaoming ;

Han, Jizhong ;

Liu, Si .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4954-4963

[10] Actor and Action Video Segmentation from a Sentence [J].

Gavrilyuk, Kirill ;

Ghodrati, Amir ;

Li, Zhenyang ;

Snoek, Cees G. M. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966

← 1 2 3 4 5 →