Generalized Decoding for Pixel, Image, and Language

被引:66
作者
Zou, Xueyan [1 ]
Dou, Zi-Yi [2 ]
Yang, Jianwei [3 ]
Gan, Zhe [4 ]
Li, Linjie [4 ]
Li, Chunyuan [3 ]
Dai, Xiyang [4 ]
Behl, Harkirat [3 ]
Wang, Jianfeng [4 ]
Yuan, Lu [4 ]
Peng, Nanyun [2 ]
Wang, Lijuan [4 ]
Lee, Yong Jae [1 ]
Gao, Jianfeng [3 ]
机构
[1] Univ Wisconsin Madison, Madison, WI 53706 USA
[2] UCLA, Los Angeles, CA 90024 USA
[3] Microsoft Res Redmond, Redmond, WA USA
[4] Microsoft Cloud & AI, Redmond, WA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01451
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Without any pseudo-labeling, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level understanding. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on seven datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing shown in Fig. 1). Code, demo, video and visualization are available at: https://x-decodervl.github.io.
引用
收藏
页码:15116 / 15127
页数:12
相关论文
共 91 条
[61]  
Sharma P, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2556
[62]  
Singh Amanpreet, 2022, CVPR
[63]   Effects of Prepregnancy Body Mass Index, Weight Gain, and Gestational Diabetes Mellitus on Pregnancy Outcomes: A Population-Based Study in Xiamen, China, 2011-2018 [J].
Su, Wei-juan ;
Chen, Yin-ling ;
Huang, Pei-ying ;
Shi, Xiu-lin ;
Yan, Fang-fang ;
Chen, Zheng ;
Yan, Bing ;
Song, Hai-qu ;
Lin, Ming-zhu ;
Li, Xue-jun .
ANNALS OF NUTRITION AND METABOLISM, 2019, 75 (01) :31-38
[64]  
Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100
[65]  
Vaswani A, 2017, P INT C NEUR INF PRO, P6000
[66]   MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers [J].
Wang, Huiyu ;
Zhu, Yukun ;
Adam, Hartwig ;
Yuille, Alan ;
Chen, Liang-Chieh .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5459-5470
[67]  
Wang Jianfeng, 2021, ARXIV211110023
[68]  
Wang Jianfeng, 2022, ARXIV220514100
[69]  
Wang Peng, 2022, ICML
[70]  
Wang Wenhui, 2022, ARXIV220810442