Object-Agnostic Transformers for Video Referring Segmentation

被引:14
作者
Yang, Xu [1 ]
Wang, Hao [1 ]
Xie, De [1 ]
Deng, Cheng [1 ]
Tao, Dacheng [2 ]
机构
[1] Xidian Univ, Sch Elect Engn, Xian 710071, Peoples R China
[2] JD Explore Acad, Beijing 101111, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Visualization; Transformers; Feature extraction; Object detection; Image segmentation; Context modeling; Video referring segmentation; multi-modal learning; video grounding; NETWORK;
D O I
10.1109/TIP.2022.3161832
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video referring segmentation focuses on segmenting out the object in a video based on the corresponding textual description. Previous works have primarily tackled this task by devising two crucial parts, an intra-modal module for context modeling and an inter-modal module for heterogeneous alignment. However, there are two essential drawbacks of this approach: (1) it lacks joint learning of context modeling and heterogeneous alignment, leading to insufficient interactions among input elements; (2) both modules require task-specific expert knowledge to design, which severely limits the flexibility and generality of prior methods. To address these problems, we here propose a novel Object-Agnostic Transformer-based Network, called OATNet, that simultaneously conducts intra-modal and inter-modal learning for video referring segmentation, without the aid of object detection or category-specific pixel labeling. More specifically, we first directly feed the sequence of textual tokens and visual tokens (pixels rather than detected object bounding boxes) into a multi-modal encoder, where context and alignment are simultaneously and effectively explored. We then design a novel cascade segmentation network to decouple our task into coarse-grained segmentation and fine-grained refinement. Moreover, considering the difficulty of samples, a more balanced metric is provided to better diagnose the performance of the proposed method. Extensive experiments on two popular datasets, A2D Sentences and J-HMDB Sentences, demonstrate that our proposed approach noticeably outperforms state-of-the-art methods.
引用
收藏
页码:2839 / 2849
页数:11
相关论文
共 44 条
[1]  
[Anonymous], 2016, ECCV
[2]   Web-Shaped Model for Head Pose Estimation: An Approach for Best Exemplar Selection [J].
Barra, Paola ;
Barra, Silvio ;
Bisogni, Carmen ;
De Marsico, Maria ;
Nappi, Michele .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :5457-5468
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]   Long-Term Video Prediction via Criticization and Retrospection [J].
Chen, Xinyuan ;
Xu, Chang ;
Yang, Xiaokang ;
Tao, Dacheng .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :7090-7103
[5]  
Chen Y. -C., 2020, PROC EUR C COMPUT VI, P1
[6]  
Chung Junyoung, 2014, ARXIV
[7]  
Deng J., ARXIV210408541, V2021
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]   Actor and Action Video Segmentation from a Sentence [J].
Gavrilyuk, Kirill ;
Ghodrati, Amir ;
Li, Zhenyang ;
Snoek, Cees G. M. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966