MRRVOS: Modular Refinement Referring Video Object Segmentation

被引：0

作者：

Duan, Zhijiang ^{[1
]}

Sun, Yukuan ^{[2
]}

Wang, Jianming ^{[1
]}

机构：

[1] TianGong Univ, Sch Comp Sci & Technol, Tianjin 300387, Peoples R China

[2] TianGong Univ, Ctr Engn Intership & Training, Tianjin 300387, Peoples R China

来源：

WEB AND BIG DATA | 2021年 / 1505卷

关键词：

Referring video object segmentation; Semantic similarly; Image caption;

D O I：

10.1007/978-981-16-8143-1_11

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The previous referring video object segmentation method only focuses on the final prediction result. When the user input linguistic query does not match the information in the video, the video still segments error information. We propose a new reference video object segmentation framework. In our model, which we call the Modular Refinement Referring Video Object Segmentation (MRRVOS), when the objects in the referring linguistic query do not match the video frame, stop segmentation and feedback error information. Firstly, given a video clip and a linguistic query. Our method segments the specified object in a video frame automatically. Then our method matches linguistic query with video frames and implements object segmentation on other frames through a recursive three-module model: (1) Referring Video Object Segmentation module: we consider the referring video object segmentation task as a joint problem of referring object segmentation in the image and mask propagation in the video. (2) Image caption module: using recurrent neural networks (RNNs), and deep convolutional neural network (CNN) to encode every frame except the first frame, and a Long Short Term Memory (LSTM) RNN decoder to generate the output caption and put it into the corpus. (3) Semantic dissimilarity module: put all the text results into the corpus and embed vector space, our linguistic query of the input to perform a semantic dissimilarity search. We show that our approach is competitive to the state-of-the-art method.

引用

页码：117 / 128

页数：12

共 27 条

[1]

Andreas J., 2016, P 2016 C N AM CHAPT, P1545

[2]

Andreas J, 2017, PR MACH LEARN RES, V70

[3] Neural Module Networks [J].

Andreas, Jacob ;

Rohrbach, Marcus ;

Darrell, Trevor ;

Klein, Dan .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48

[4]

Bellver M., 2020, ARXIV PREPRINT ARXIV

[5] Video Search Engine Optimization Using Keyword and Feature Analysis [J].

Choudhari, Krishna ;

Bhalla, Vinod K. .

SECOND INTERNATIONAL SYMPOSIUM ON COMPUTER VISION AND THE INTERNET (VISIONNET'15), 2015, 58 :691-697

[6] Actor and Action Video Segmentation from a Sentence [J].

Gavrilyuk, Kirill ;

Ghodrati, Amir ;

Li, Zhenyang ;

Snoek, Cees G. M. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966

[7]

Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[8] Learning to Reason: End-to-End Module Networks for Visual Question Answering [J].

Hu, Ronghang ;

Andreas, Jacob ;

Rohrbach, Marcus ;

Darrell, Trevor ;

Saenko, Kate .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :804-813

[9] Modeling Relationships in Referential Expressions with Compositional Modular Networks [J].

Hu, Ronghang ;

Rohrbach, Marcus ;

Andreas, Jacob ;

Darrell, Trevor ;

Saenko, Kate .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4418-4427

[10] Inferring and Executing Programs for Visual Reasoning [J].

Johnson, Justin ;

Hariharan, Bharath ;

van der Maaten, Laurens ;

Hoffman, Judy ;

Li Fei-Fei ;

Zitnick, C. Lawrence ;

Girshick, Ross .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3008-3017

← 1 2 3 →