Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

被引：160

作者：

Xu, Jiarui ^{[1
]}

Liu, Sifei ^{[2
]}

Vahdat, Arash ^{[2
]}

Byeon, Wonmin ^{[2
]}

Wang, Xiaolong ^{[1
]}

De Meo, Shalini ^{[2
]}

机构：

[1] Univ Calif San Diego, La Jolla, CA 92093 USA

[2] NVIDIA, Santa Clara, CA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00289

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained textimage diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state of the art. We open-source our code and models at https://github.com/NVlabs/ODISE.

引用

页码：2955 / 2966

页数：12

共 42 条

[1]

Bucher M, 2019, ADV NEUR IN, V32

[2] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[3] Safe Model-Free Optimal Voltage Control via Continuous-Time Zeroth-Order Methods [J].

Chen, Xin ;

Poveda, Jorge, I ;

Li, N. .

2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2021, :4064-4070

[4]

Cheng B, 2021, ADV NEUR IN, V34

[5] Masked-attention Mask Transformer for Universal Image Segmentation [J].

Cheng, Bowen ;

Misra, Ishan ;

Schwing, Alexander G. ;

Kirillov, Alexander ;

Girdhar, Rohit .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289

[6] Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation [J].

Cheng, Bowen ;

Collins, Maxwell D. ;

Zhu, Yukun ;

Liu, Ting ;

Huang, Thomas S. ;

Adam, Hartwig ;

Chen, Liang-Chieh .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12472-12482

[7]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[8]

Dhariwal P, 2021, ADV NEUR IN, V34

[9] Decoupling Zero-Shot Semantic Segmentation [J].

Ding, Jian ;

Xue, Nan ;

Xia, Gui-Song ;

Dai, Dengxin .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :11573-11582

[10] Taming Transformers for High-Resolution Image Synthesis [J].

Esser, Patrick ;

Rombach, Robin ;

Ommer, Bjoern .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12868-12878

← 1 2 3 4 5 →