Contrastive Grouping with Transformer for Referring Image Segmentation

被引:34
作者
Tang, Jiajin [1 ]
Zheng, Ge [1 ]
Shi, Cheng [1 ]
Yang, Sibei [1 ,2 ]
机构
[1] ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.02257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves crosslevel interaction by jointly updating the query tokens and decoding masks in every two consecutive layers. Finally, CGFormer cooperates contrastive learning to the grouping strategy to identify the token and its mask corresponding to the referent. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly. Code is available at https://github.com/Toneyaya/CGFormer.
引用
收藏
页码:23570 / 23580
页数:11
相关论文
共 69 条
[11]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[12]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[13]   Vision-Language Transformer and Query Generation for Referring Segmentation [J].
Ding, Henghui ;
Liu, Chang ;
Wang, Suchen ;
Jiang, Xudong .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :16301-16310
[14]   Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [J].
Feng, Guang ;
Hu, Zhiwei ;
Zhang, Lihe ;
Lu, Huchuan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15501-15510
[15]   Dual Attention Network for Scene Segmentation [J].
Fu, Jun ;
Liu, Jing ;
Tian, Haijie ;
Li, Yong ;
Bao, Yongjun ;
Fang, Zhiwei ;
Lu, Hanqing .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3141-3149
[16]  
He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
[17]   Segmentation from Natural Language Expressions [J].
Hu, Ronghang ;
Rohrbach, Marcus ;
Darrell, Trevor .
COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :108-124
[18]   Bi-directional Relationship Inferring Network for Referring Image Segmentation [J].
Hu, Zhiwei ;
Feng, Guang ;
Sun, Jiayu ;
Zhang, Lihe ;
Lu, Huchuan .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4423-4432
[19]   Referring Image Segmentation via Cross-Modal Progressive Comprehension [J].
Huang, Shaofei ;
Hui, Tianrui ;
Liu, Si ;
Li, Guanbin ;
Wei, Yunchao ;
Han, Jizhong ;
Liu, Luoqi ;
Li, Bo .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10485-10494
[20]   Linguistic Structure Guided Context Modeling for Referring Image Segmentation [J].
Hui, Tianrui ;
Liu, Si ;
Huang, Shaofei ;
Li, Guanbin ;
Yu, Sansi ;
Zhang, Faxi ;
Han, Jizhong .
COMPUTER VISION - ECCV 2020, PT X, 2020, 12355 :59-75