Mask prior generation with language queries guided networks for referring image segmentation

被引:0
作者
Zhou, Jinhao [1 ]
Xiao, Guoqiang [1 ]
Lew, Michael S. [3 ]
Wu, Song [1 ,2 ]
机构
[1] Southwest Univ, Coll Comp & Informat Sci, Chongqing 400715, Peoples R China
[2] Southwest Univ, Yibin Acad, Yibin 644000, Sichuan, Peoples R China
[3] Leiden Univ, LIACS Media Lab, Leiden, Netherlands
关键词
Referring image segmentation; Bidirectional spatial alignment; Channel attention fusion gate; Mask prior generator;
D O I
10.1016/j.cviu.2025.104296
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The aim of Referring Image Segmentation (RIS) is to generate a pixel-level mask to accurately segment the target object according to its natural language expression. Previous RIS methods ignore exploring the significant language information in both the encoder and decoder stages, and simply use an upsampling-convolution operation to obtain the prediction mask, resulting in inaccurate visual object locating. Thus, this paper proposes a Mask Prior Generation with Language Queries Guided Network (MPG-LQGNet). In the encoder of MPGLQGNet, a Bidirectional Spatial Alignment Module (BSAM) is designed to realize the bidirectional fusion for both vision and language embeddings, generating additional language queries to understand both the locating of targets and the semantics of the language. Moreover, a Channel Attention Fusion Gate (CAFG) is designed to enhance the exploration of the significance of the cross-modal embeddings. In the decoder of the MPG-LQGNet, the Language Query Guided Mask Prior Generator (LQPG) is designed to utilize the generated language queries to activate significant information in the upsampled decoding features, obtaining the more accurate mask prior that guides the final prediction. Extensive experiments on RefCOCO series datasets show that our method consistently improves over state-of-the-art methods. The source code of our MPG-LQGNet is available at https://github.com/SWU-CS-MediaLab/MPG-LQGNet.
引用
收藏
页数:13
相关论文
共 54 条
[1]   Interactive Text2Pickup Networks for Natural Language-Based Human-Robot Collaboration [J].
Ahn, Hyemin ;
Choi, Sungjoon ;
Kim, Nuri ;
Cha, Geonho ;
Oh, Songhwai .
IEEE ROBOTICS AND AUTOMATION LETTERS, 2018, 3 (04) :3308-3315
[2]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[3]   Language-Based Image Editing with Recurrent Attentive Models [J].
Chen, Jianbo ;
Shen, Yelong ;
Gao, Jianfeng ;
Liu, Jingjing ;
Liu, Xiaodong .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8721-8729
[4]   Exploring the Limitations of Behavior Cloning for Autonomous Driving [J].
Codevilla, Felipe ;
Santana, Eder ;
Lopez, Antonio M. ;
Gaidon, Adrien .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9328-9337
[5]   Attentional Feature Fusion [J].
Dai, Yimian ;
Gieseke, Fabian ;
Oehmcke, Stefan ;
Wu, Yiquan ;
Barnard, Kobus .
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, :3559-3568
[6]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]   Bilateral Knowledge Interaction Network for Referring Image Segmentation [J].
Ding, Haixin ;
Zhang, Shengchuan ;
Wu, Qiong ;
Yu, Songlin ;
Hu, Jie ;
Cao, Liujuan ;
Ji, Rongrong .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :2966-2977
[9]   Vision-Language Transformer and Query Generation for Referring Segmentation [J].
Ding, Henghui ;
Liu, Chang ;
Wang, Suchen ;
Jiang, Xudong .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :16301-16310
[10]   Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [J].
Feng, Guang ;
Hu, Zhiwei ;
Zhang, Lihe ;
Lu, Huchuan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15501-15510