Vision-Aware Language Reasoning for Referring Image Segmentation

被引:0
作者
Xu, Fayou [1 ]
Luo, Bing [1 ]
Zhang, Chao [2 ]
Xu, Li [3 ]
Pu, Mingxing [1 ]
Li, Bo [1 ]
机构
[1] Xihua Univ, Sch Comp & Software Engn, Chengdu 610039, Peoples R China
[2] Sichuan Police Coll, Key Lab Intelligent Policing, Luzhou 646000, Peoples R China
[3] Xihua Univ, Sch Sci, Chengdu 610039, Peoples R China
关键词
Referring image segmentation; Vision and language; Explainable language-structure reasoning;
D O I
10.1007/s11063-023-11377-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation is a multimodal joint task that aims to segment linguistically indicated objects from images in paired expressions and images. However, the diversity of language annotations trends to result in semantic ambiguity, which makes the semantic representation of language feature encoding imprecise. Existing methods ignore the correction of language encoding module, so that the semantic error of language features cannot be improved in the subsequent process, resulting in semantic deviation. To this end, we propose a vision-aware language reasoning model. Intuitively, the segmentation result can be used to guide the reconstruction of language features, which could be expressed as a tree-structured recursive process. Specifically, we designed a language reasoning encoding module and a mask loopback optimization module to optimize the language encoding tree. The feature weights of tree nodes are learned through backpropagation. In order to overcome the problem that local language words and visual regions are easily introduced into noise regions in the traditional attention module, we use the global language prior information to calculate the importance of different words to further weight the visual region features, which could be embodied as language-aware vision attention module. Our experimental results on four benchmark datasets show that the proposed method achieves performance improvement.
引用
收藏
页码:11313 / 11331
页数:19
相关论文
共 52 条
[1]  
[Anonymous], 2014, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
[2]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[3]  
Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
[4]   Interpretable Visual Question Answering by Reasoning on Dependency Trees [J].
Cao, Qingxing ;
Liang, Xiaodan ;
Li, Bailin ;
Lin, Liang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) :887-901
[5]   Visual Question Reasoning on General Dependency Tree [J].
Cao, Qingxing ;
Liang, Xiaodan ;
Li, Bailin ;
Li, Guanbin ;
Lin, Liang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7249-7257
[6]   See-Through-Text Grouping for Referring Image Segmentation [J].
Chen, Ding-Jie ;
Jia, Songhao ;
Lo, Yi-Chen ;
Chen, Hwann-Tzong ;
Liu, Tyng-Luh .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462
[7]   Language-Based Image Editing with Recurrent Attentive Models [J].
Chen, Jianbo ;
Shen, Yelong ;
Gao, Jianfeng ;
Liu, Jingjing ;
Liu, Xiaodong .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8721-8729
[8]  
Chen LB, 2017, IEEE INT SYMP NANO, P1, DOI 10.1109/NANOARCH.2017.8053709
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Dosovitskiy A., 2021, arXiv