Contrastive Grouping with Transformer for Referring Image Segmentation

被引:32
作者
Tang, Jiajin [1 ]
Zheng, Ge [1 ]
Shi, Cheng [1 ]
Yang, Sibei [1 ,2 ]
机构
[1] ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.02257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves crosslevel interaction by jointly updating the query tokens and decoding masks in every two consecutive layers. Finally, CGFormer cooperates contrastive learning to the grouping strategy to identify the token and its mask corresponding to the referent. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly. Code is available at https://github.com/Toneyaya/CGFormer.
引用
收藏
页码:23570 / 23580
页数:11
相关论文
共 69 条
[1]  
[Anonymous], 2017, 5 INT C LEARNING REP
[2]  
Carion N, 2020, Img Proc Comp Vis Re, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
[3]   See-Through-Text Grouping for Referring Image Segmentation [J].
Chen, Ding-Jie ;
Jia, Songhao ;
Lo, Yi-Chen ;
Chen, Hwann-Tzong ;
Liu, Tyng-Luh .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7453-7462
[4]   Language-Based Image Editing with Recurrent Attentive Models [J].
Chen, Jianbo ;
Shen, Yelong ;
Gao, Jianfeng ;
Liu, Jingjing ;
Liu, Xiaodong .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8721-8729
[5]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[6]  
Chen LB, 2017, IEEE INT SYMP NANO, P1, DOI 10.1109/NANOARCH.2017.8053709
[7]  
Chen Y.-W., 2019, BMVC
[8]   Dynamic Convolution: Attention over Convolution Kernels [J].
Chen, Yinpeng ;
Dai, Xiyang ;
Liu, Mengchen ;
Chen, Dongdong ;
Yuan, Lu ;
Liu, Zicheng .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11027-11036
[9]  
Cheng B, 2021, ADV NEUR IN, V34
[10]  
Cheng Bowen, 2021, ARXIV211210764