Generalized Decoding for Pixel, Image, and Language

被引:66
作者
Zou, Xueyan [1 ]
Dou, Zi-Yi [2 ]
Yang, Jianwei [3 ]
Gan, Zhe [4 ]
Li, Linjie [4 ]
Li, Chunyuan [3 ]
Dai, Xiyang [4 ]
Behl, Harkirat [3 ]
Wang, Jianfeng [4 ]
Yuan, Lu [4 ]
Peng, Nanyun [2 ]
Wang, Lijuan [4 ]
Lee, Yong Jae [1 ]
Gao, Jianfeng [3 ]
机构
[1] Univ Wisconsin Madison, Madison, WI 53706 USA
[2] UCLA, Los Angeles, CA 90024 USA
[3] Microsoft Res Redmond, Redmond, WA USA
[4] Microsoft Cloud & AI, Redmond, WA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01451
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Without any pseudo-labeling, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level understanding. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on seven datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing shown in Fig. 1). Code, demo, video and visualization are available at: https://x-decodervl.github.io.
引用
收藏
页码:15116 / 15127
页数:12
相关论文
共 91 条
[1]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[2]  
[Anonymous], 2021, NEURIPS, DOI DOI 10.1016/J.JCBS.2021.01.008
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]   YOLACT Real-time Instance Segmentation [J].
Bolya, Daniel ;
Zhou, Chong ;
Xiao, Fanyi ;
Lee, Yong Jae .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9156-9165
[5]  
Cai Z., 2022, ARXIV220405626
[6]  
Carion N., 2020, P EUR C COMP VIS GLA, P213, DOI DOI 10.1007/978-3-030-58452-813
[7]  
Chen LB, 2017, IEEE INT SYMP NANO, P1, DOI 10.1109/NANOARCH.2017.8053709
[8]  
Chen Ting, 2022, ARXIV220607669
[9]  
Chen X., 2015, Microsoft COCO captions: Data collection and evaluation server
[10]  
Chen Yen-Chun, 2020, Eccv