Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

被引：39

作者：

Cha, Junbum ^{[1
]}

Mun, Jonghwan ^{[1
]}

Roh, Byungseok ^{[1
]}

机构：

[1] Kakao Brain, Seongnam, South Korea

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01074

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

引用

页码：11165 / 11174

页数：10

共 33 条

[1]

[Anonymous], 2014, FRONT NANOBIOMED RES

[2]

[Anonymous], 2021, EMNLP

[3] Single-Stage Semantic Segmentation from Image Labels [J].

Araslanov, Nikita ;

Roth, Stefan .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4252-4261

[4]

Byeon Minwoo, 2022, Coyo-700m: Image-text pair dataset

[5] COCO-Stuff: Thing and Stuff Classes in Context [J].

Caesar, Holger ;

Uijlings, Jasper ;

Ferrari, Vittorio .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1209-1218

[6]

Cha Junbum, 2022, EUR C COMP VIS ECCV

[7]

Changpinyo Soravit, 2021, P IEEE CVF C COMP VI

[8]

Cheng B, 2021, ADV NEUR IN, V34

[9] Xception: Deep Learning with Depthwise Separable Convolutions [J].

Chollet, Francois .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807

[10] The Cityscapes Dataset for Semantic Urban Scene Understanding [J].

Cordts, Marius ;

Omran, Mohamed ;

Ramos, Sebastian ;

Rehfeld, Timo ;

Enzweiler, Markus ;

Benenson, Rodrigo ;

Franke, Uwe ;

Roth, Stefan ;

Schiele, Bernt .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223

← 1 2 3 4 →