Actor and Action Modular Network for Text-Based Video Segmentation

被引：5

作者：

Yang, Jianhua ^{[1
]}

Huang, Yan ^{[2
,3
]}

Niu, Kai ^{[4
]}

Huang, Linjiang ^{[5
]}

Ma, Zhanyu ^{[1
]}

Wang, Liang ^{[2
,3
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Pattern Recognit & Intelligent Syst Lab, Beijing 100876, Peoples R China

[2] Chinese Acad Sci CASIA, Ctr Res Intelligent Percept & Comp CRIPAC, Inst Automat, Natl Lab Pattern Recognit NLPR, Beijing 100190, Peoples R China

[3] Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China

[5] Chinese Univ Hong Kong, Multimedia Lab, Hong Kong 999077, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

Semantics; Task analysis; Electron tubes; Proposals; Predictive models; Image color analysis; Electronic mail; Video object segmentation; language attention mechanism; modular network; multi-modal learning;

D O I：

10.1109/TIP.2022.3185487

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of semantic asymmetry. The semantic asymmetry implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

引用

页码：4474 / 4489

页数：16

共 60 条

[1] Andreas J., 2016, NAACL, P1545
[2] [Anonymous], 2014, P EMNLP
[3] [Anonymous], 2010, P ECCV
[4] [Anonymous], P BMVC2018 C
[5] See-Through-Text Grouping for Referring Image Segmentation
Chen, Ding-Jie
Jia, Songhao
Lo, Yi-Chen
Chen, Hwann-Tzong
Liu, Tyng-Luh
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7453 - 7462
[6] Chen J, 2020, P IEEE CVF C COMP VI, P9901
[7] Dang K., 2018, ARXIV180708430
[8] TALL: Temporal Activity Localization via Language Query
Gao, Jiyang
Sun, Chen
Yang, Zhenheng
Nevatia, Ram
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5277 - 5285
[9] Actor and Action Video Segmentation from a Sentence
Gavrilyuk, Kirill
Ghodrati, Amir
Li, Zhenyang
Snoek, Cees G. M.
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5958 - 5966
[10] ActionVLAD: Learning spatio-temporal aggregation for action classification
Girdhar, Rohit
Ramanan, Deva
Gupta, Abhinav
Sivic, Josef
Russell, Bryan
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3165 - 3174

← 1 2 3 4 5 6 →