SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

被引:0
作者
Luo, Zhuoyan [1 ]
Xiao, Yicheng [1 ]
Liu, Yong [1 ]
Li, Shuyan [3 ]
Wang, Yitong [2 ]
Tang, Yansong [1 ]
Li, Xiu [1 ]
Yang, Yujiu [1 ]
机构
[1] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] ByteDance Inc, Beijing, Peoples R China
[3] Univ Cambridge, Engn Dept, Cambridge, England
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code is available at https://github.com/RobertLuo1/NeurIPS2023_SOC.
引用
收藏
页数:13
相关论文
共 48 条
[1]  
[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00585
[2]  
[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00412
[3]   End-to-End Referring Video Object Segmentation with Multimodal Transformers [J].
Botach, Adam ;
Zheltonozhskii, Evgenii ;
Baskin, Chaim .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4975-4985
[4]  
Carion N., 2020, ECCV, P213
[5]  
CHEN SR, 2020, AAAI, P2152
[6]  
Ding H., 2022, TPAMI
[7]  
DING HH, 2021, ICCV, P1630, DOI DOI 10.1109/ICCV48922.2021.01601
[8]  
DING HH, 2019, CVPR, P8877, DOI DOI 10.1109/CVPR.2019.00909
[9]   Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation [J].
Ding, Zihan ;
Hui, Tianrui ;
Huang, Junshi ;
Wei, Xiaoming ;
Han, Jizhong ;
Liu, Si .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4954-4963
[10]   Actor and Action Video Segmentation from a Sentence [J].
Gavrilyuk, Kirill ;
Ghodrati, Amir ;
Li, Zhenyang ;
Snoek, Cees G. M. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966