Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

被引:39
作者
Cha, Junbum [1 ]
Mun, Jonghwan [1 ]
Roh, Byungseok [1 ]
机构
[1] Kakao Brain, Seongnam, South Korea
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01074
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.
引用
收藏
页码:11165 / 11174
页数:10
相关论文
共 33 条
[11]   The Pascal Visual Object Classes (VOC) Challenge [J].
Everingham, Mark ;
Van Gool, Luc ;
Williams, Christopher K. I. ;
Winn, John ;
Zisserman, Andrew .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338
[12]   Scaling Open-Vocabulary Image Segmentation with Image-Level Labels [J].
Ghiasi, Golnaz ;
Gu, Xiuye ;
Cui, Yin ;
Lin, Tsung-Yi .
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 :540-557
[13]   Demo: TINGLE: Pushing Edge Intelligence in Synchronization and Useful Data Transfer for Human-Robotic Arm Interactions [J].
Gu, Xinjie ;
Wang, Xiaolong ;
Feng, Yuchen ;
Long, Yuzhu ;
Mukherjee, Mithun ;
Pan, Zhigeng ;
Guo, Mian ;
Zhangz, Qi .
IEEE INFOCOM 2022 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), 2022,
[14]  
Jang E., 2017, CATEGORICAL REPARAME
[15]   MDETR - Modulated Detection for End-to-End Multi-Modal Understanding [J].
Kamath, Aishwarya ;
Singh, Mannat ;
Lecun, Yann ;
Synnaeve, Gabriel ;
Misra, Ishan ;
Carion, Nicolas .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1760-1770
[16]  
Krahenbuhl P., 2011, NIPS, DOI DOI 10.48550/ARXIV.1210.5644
[17]  
Li B., 2021, INT C LEARN REPR
[18]  
Liang Feng, 2022, ARXIV221004150
[19]  
Loshchilov I., 2019, INT C LEARNING REPRE, DOI DOI 10.48550/ARXIV.1711.05101
[20]  
MMSegmentation Contributors, 2020, MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark