A closer look at referring expressions for video object segmentation

被引:11
作者
Bellver, Miriam [1 ]
Ventura, Carles [2 ]
Silberer, Carina [3 ]
Kazakos, Ioannis [4 ]
Torres, Jordi [1 ]
Giro-i-Nieto, Xavier [5 ,6 ]
机构
[1] Barcelona Supercomp Ctr BSC, Barcelona, Spain
[2] Univ Oberta Catalunya UOC, Barcelona, Spain
[3] Univ Stuttgart, Inst NLP, Stuttgart, Germany
[4] Natl Tech Univ Athens, Athens, Greece
[5] Univ Politecn Catalunya UPC, Barcelona, Catalonia, Spain
[6] CSIC UPC, Inst Robot & Informat Ind, Barcelona, Catalonia, Spain
关键词
Referring expressions; Video object segmentation; Vision and language;
D O I
10.1007/s11042-022-13413-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of Language-guided Video Object Segmentation (LVOS) aims at generating binary masks for an object referred by a linguistic expression. When this expression unambiguously describes an object in the scene, it is named referring expression (RE). Our work argues that existing benchmarks used for LVOS are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the referring expressions in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, where the non-trivial REs are further annotated with seven RE semantic categories. We leverage these data to analyze the performance of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for LVOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.
引用
收藏
页码:4419 / 4438
页数:20
相关论文
共 52 条
[11]   MEASURING AGREEMENT FOR MULTINOMIAL DATA [J].
DAVIES, M ;
FLEISS, JL .
BIOMETRICS, 1982, 38 (04) :1047-1051
[12]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[13]  
Feng Q, 2020, IEEE WINT CONF APPL, P689, DOI [10.1109/WACV45572.2020.9093425, 10.1109/wacv45572.2020.9093425]
[14]   Actor and Action Video Segmentation from a Sentence [J].
Gavrilyuk, Kirill ;
Ghodrati, Amir ;
Li, Zhenyang ;
Snoek, Cees G. M. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5958-5966
[15]  
Hamp Birgit, 1997, P ACL WORKSH AUT INF
[16]  
Hervas R, 2010, P 48 ANN M ASS COMP, P49
[17]   Segmentation from Natural Language Expressions [J].
Hu, Ronghang ;
Rohrbach, Marcus ;
Darrell, Trevor .
COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :108-124
[18]   Bi-directional Relationship Inferring Network for Referring Image Segmentation [J].
Hu, Zhiwei ;
Feng, Guang ;
Sun, Jiayu ;
Zhang, Lihe ;
Lu, Huchuan .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4423-4432
[19]   Referring Image Segmentation via Cross-Modal Progressive Comprehension [J].
Huang, Shaofei ;
Hui, Tianrui ;
Liu, Si ;
Li, Guanbin ;
Wei, Yunchao ;
Han, Jizhong ;
Liu, Luoqi ;
Li, Bo .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10485-10494
[20]   CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning [J].
Johnson, Justin ;
Hariharan, Bharath ;
van der Maaten, Laurens ;
Fei-Fei, Li ;
Zitnick, C. Lawrence ;
Girshick, Ross .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1988-1997