Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation

被引:30
作者
Liu, Chang [1 ]
Ding, Henghui [2 ]
Zhang, Yulun [2 ]
Jiang, Xudong [1 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn EEE, Singapore 639798, Singapore
[2] Swiss Fed Inst Technol, Comp Vis Lab CVL, CH-8092 Zurich, Switzerland
关键词
Transformers; Decoding; Image segmentation; Task analysis; Feature extraction; Image reconstruction; Iterative methods; Referring image segmentation; multi-modal mutual attention; iterative multi-modal interaction; language feature reconstruction;
D O I
10.1109/TIP.2023.3277791
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention (M(3)Att) and Multi-Modal Mutual Decoder (M(3)Dec) that better fuse information from the two input modalities. Based on M(3)Dec, we further propose Iterative Multi-modal Interaction (IMI) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction (LFR) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.
引用
收藏
页码:3054 / 3065
页数:12
相关论文
共 56 条
[41]   Toward Achieving Robust Low-Level and High-Level Scene Parsing [J].
Shuai, Bing ;
Ding, Henghui ;
Liu, Ting ;
Wang, Gang ;
Jiang, Xudong .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (03) :1378-1390
[42]  
Vaswani A, 2017, ADV NEUR IN, V30
[43]   Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks [J].
Wang, Peng ;
Wu, Qi ;
Cao, Jiewei ;
Shen, Chunhua ;
Gao, Lianli ;
van den Hengel, Anton .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1960-1968
[44]   Non-local Neural Networks [J].
Wang, Xiaolong ;
Girshick, Ross ;
Gupta, Abhinav ;
He, Kaiming .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7794-7803
[45]  
Wang Z., 2022, CVPR, P11686
[46]  
Xie EZ, 2021, ADV NEUR IN, V34
[47]   Bottom-Up Shift and Reasoning for Referring Image Segmentation [J].
Yang, Sibei ;
Xia, Meng ;
Li, Guanbin ;
Zhou, Hong-Yu ;
Yu, Yizhou .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11261-11270
[48]   LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [J].
Yang, Zhao ;
Wang, Jiaqi ;
Tang, Yansong ;
Chen, Kai ;
Zhao, Hengshuang ;
Torr, Philip H. S. .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :18134-18144
[49]   Improving One-Stage Visual Grounding by Recursive Sub-query Construction [J].
Yang, Zhengyuan ;
Chen, Tianlang ;
Wang, Liwei ;
Luo, Jiebo .
COMPUTER VISION - ECCV 2020, PT XIV, 2020, 12359 :387-404
[50]   A Fast and Accurate One-Stage Approach to Visual Grounding [J].
Yang, Zhengyuan ;
Gong, Boqing ;
Wang, Liwei ;
Huang, Wenbing ;
Yu, Dong ;
Luo, Jiebo .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4682-4692