Generalized Decoding for Pixel, Image, and Language

被引：66

作者：

Zou, Xueyan ^{[1
]}

Dou, Zi-Yi ^{[2
]}

Yang, Jianwei ^{[3
]}

Gan, Zhe ^{[4
]}

Li, Linjie ^{[4
]}

Li, Chunyuan ^{[3
]}

Dai, Xiyang ^{[4
]}

Behl, Harkirat ^{[3
]}

Wang, Jianfeng ^{[4
]}

Yuan, Lu ^{[4
]}

Peng, Nanyun ^{[2
]}

Wang, Lijuan ^{[4
]}

Lee, Yong Jae ^{[1
]}

Gao, Jianfeng ^{[3
]}

机构：

[1] Univ Wisconsin Madison, Madison, WI 53706 USA

[2] UCLA, Los Angeles, CA 90024 USA

[3] Microsoft Res Redmond, Redmond, WA USA

[4] Microsoft Cloud & AI, Redmond, WA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01451

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Without any pseudo-labeling, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level understanding. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on seven datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing shown in Fig. 1). Code, demo, video and visualization are available at: https://x-decodervl.github.io.

引用

页码：15116 / 15127

页数：12

共 91 条

[61]

Sharma P, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2556

[62]

Singh Amanpreet, 2022, CVPR

[63] Effects of Prepregnancy Body Mass Index, Weight Gain, and Gestational Diabetes Mellitus on Pregnancy Outcomes: A Population-Based Study in Xiamen, China, 2011-2018 [J].

Su, Wei-juan ;

Chen, Yin-ling ;

Huang, Pei-ying ;

Shi, Xiu-lin ;

Yan, Fang-fang ;

Chen, Zheng ;

Yan, Bing ;

Song, Hai-qu ;

Lin, Ming-zhu ;

Li, Xue-jun .

ANNALS OF NUTRITION AND METABOLISM, 2019, 75 (01) :31-38

[64]

Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100

[65]

Vaswani A, 2017, P INT C NEUR INF PRO, P6000

[66] MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers [J].

Wang, Huiyu ;

Zhu, Yukun ;

Adam, Hartwig ;

Yuille, Alan ;

Chen, Liang-Chieh .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5459-5470

[67]

Wang Jianfeng, 2021, ARXIV211110023

[68]

Wang Jianfeng, 2022, ARXIV220514100

[69]

Wang Peng, 2022, ICML

[70]

Wang Wenhui, 2022, ARXIV220810442

← 1 2 3 4 5 6 7 8 9 10 →