Object-Agnostic Transformers for Video Referring Segmentation

被引：14

作者：

Yang, Xu ^{[1
]}

Wang, Hao ^{[1
]}

Xie, De ^{[1
]}

Deng, Cheng ^{[1
]}

Tao, Dacheng ^{[2
]}

机构：

[1] Xidian Univ, Sch Elect Engn, Xian 710071, Peoples R China

[2] JD Explore Acad, Beijing 101111, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Task analysis; Visualization; Transformers; Feature extraction; Object detection; Image segmentation; Context modeling; Video referring segmentation; multi-modal learning; video grounding; NETWORK;

D O I：

10.1109/TIP.2022.3161832

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video referring segmentation focuses on segmenting out the object in a video based on the corresponding textual description. Previous works have primarily tackled this task by devising two crucial parts, an intra-modal module for context modeling and an inter-modal module for heterogeneous alignment. However, there are two essential drawbacks of this approach: (1) it lacks joint learning of context modeling and heterogeneous alignment, leading to insufficient interactions among input elements; (2) both modules require task-specific expert knowledge to design, which severely limits the flexibility and generality of prior methods. To address these problems, we here propose a novel Object-Agnostic Transformer-based Network, called OATNet, that simultaneously conducts intra-modal and inter-modal learning for video referring segmentation, without the aid of object detection or category-specific pixel labeling. More specifically, we first directly feed the sequence of textual tokens and visual tokens (pixels rather than detected object bounding boxes) into a multi-modal encoder, where context and alignment are simultaneously and effectively explored. We then design a novel cascade segmentation network to decouple our task into coarse-grained segmentation and fine-grained refinement. Moreover, considering the difficulty of samples, a more balanced metric is provided to better diagnose the performance of the proposed method. Extensive experiments on two popular datasets, A2D Sentences and J-HMDB Sentences, demonstrate that our proposed approach noticeably outperforms state-of-the-art methods.

引用

页码：2839 / 2849

页数：11

共 44 条

[1]

[Anonymous], 2016, ECCV

[2] Web-Shaped Model for Head Pose Estimation: An Approach for Best Exemplar Selection [J].

Barra, Paola ;

Barra, Silvio ;

Bisogni, Carmen ;

De Marsico, Maria ;

Nappi, Michele .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :5457-5468

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4] Long-Term Video Prediction via Criticization and Retrospection [J].

Chen, Xinyuan ;

Xu, Chang ;

Yang, Xiaokang ;

Tao, Dacheng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :7090-7103

[5]

Chen Y. -C., 2020, PROC EUR C COMPUT VI, P1

[6]

Chung Junyoung, 2014, ARXIV

[7]

Deng J., ARXIV210408541, V2021

[8]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[9] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[10] Actor and Action Video Segmentation from a Sentence [J].

Gavrilyuk, Kirill ;

Ghodrati, Amir ;

Li, Zhenyang ;

Snoek, Cees G. M. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966

← 1 2 3 4 5 →