Generalized Decoding for Pixel, Image, and Language

被引：66

作者：

Zou, Xueyan ^{[1
]}

Dou, Zi-Yi ^{[2
]}

Yang, Jianwei ^{[3
]}

Gan, Zhe ^{[4
]}

Li, Linjie ^{[4
]}

Li, Chunyuan ^{[3
]}

Dai, Xiyang ^{[4
]}

Behl, Harkirat ^{[3
]}

Wang, Jianfeng ^{[4
]}

Yuan, Lu ^{[4
]}

Peng, Nanyun ^{[2
]}

Wang, Lijuan ^{[4
]}

Lee, Yong Jae ^{[1
]}

Gao, Jianfeng ^{[3
]}

机构：

[1] Univ Wisconsin Madison, Madison, WI 53706 USA

[2] UCLA, Los Angeles, CA 90024 USA

[3] Microsoft Res Redmond, Redmond, WA USA

[4] Microsoft Cloud & AI, Redmond, WA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01451

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Without any pseudo-labeling, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level understanding. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on seven datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing shown in Fig. 1). Code, demo, video and visualization are available at: https://x-decodervl.github.io.

引用

页码：15116 / 15127

页数：12

共 91 条

[1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].

Anderson, Peter ;

Wu, Qi ;

Teney, Damien ;

Bruce, Jake ;

Johnson, Mark ;

Sunderhauf, Niko ;

Reid, Ian ;

Gould, Stephen ;

van den Hengel, Anton .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683

[2]

[Anonymous], 2021, NEURIPS, DOI DOI 10.1016/J.JCBS.2021.01.008

[3] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[4] YOLACT Real-time Instance Segmentation [J].

Bolya, Daniel ;

Zhou, Chong ;

Xiao, Fanyi ;

Lee, Yong Jae .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9156-9165

[5]

Cai Z., 2022, ARXIV220405626

[6]

Carion N., 2020, P EUR C COMP VIS GLA, P213, DOI DOI 10.1007/978-3-030-58452-813

[7]

Chen LB, 2017, IEEE INT SYMP NANO, P1, DOI 10.1109/NANOARCH.2017.8053709

[8]

Chen Ting, 2022, ARXIV220607669

[9]

Chen X., 2015, Microsoft COCO captions: Data collection and evaluation server

[10]

Chen Yen-Chun, 2020, Eccv

← 1 2 3 4 5 6 7 8 9 10 →