Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

被引：39

作者：

Cha, Junbum ^{[1
]}

Mun, Jonghwan ^{[1
]}

Roh, Byungseok ^{[1
]}

机构：

[1] Kakao Brain, Seongnam, South Korea

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01074

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

引用

页码：11165 / 11174

页数：10

共 33 条

[11] The Pascal Visual Object Classes (VOC) Challenge [J].

Everingham, Mark ;

Van Gool, Luc ;

Williams, Christopher K. I. ;

Winn, John ;

Zisserman, Andrew .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338

[12] Scaling Open-Vocabulary Image Segmentation with Image-Level Labels [J].

Ghiasi, Golnaz ;

Gu, Xiuye ;

Cui, Yin ;

Lin, Tsung-Yi .

COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 :540-557

[13] Demo: TINGLE: Pushing Edge Intelligence in Synchronization and Useful Data Transfer for Human-Robotic Arm Interactions [J].

Gu, Xinjie ;

Wang, Xiaolong ;

Feng, Yuchen ;

Long, Yuzhu ;

Mukherjee, Mithun ;

Pan, Zhigeng ;

Guo, Mian ;

Zhangz, Qi .

IEEE INFOCOM 2022 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), 2022,

[14]

Jang E., 2017, CATEGORICAL REPARAME

[15] MDETR - Modulated Detection for End-to-End Multi-Modal Understanding [J].

Kamath, Aishwarya ;

Singh, Mannat ;

Lecun, Yann ;

Synnaeve, Gabriel ;

Misra, Ishan ;

Carion, Nicolas .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1760-1770

[16]

Krahenbuhl P., 2011, NIPS, DOI DOI 10.48550/ARXIV.1210.5644

[17]

Li B., 2021, INT C LEARN REPR

[18]

Liang Feng, 2022, ARXIV221004150

[19]

Loshchilov I., 2019, INT C LEARNING REPRE, DOI DOI 10.48550/ARXIV.1711.05101

[20]

MMSegmentation Contributors, 2020, MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark

← 1 2 3 4 →