Images Speak in Images: A Generalist Painter for In-Context Visual Learning

被引:53
|
作者
Wang, Xinlong [1 ]
Wang, Wen [2 ]
Cao, Yue [1 ]
Shen, Chunhua [2 ]
Huang, Tiejun [1 ,3 ]
机构
[1] Beijing Acad Artificial Intelligence, Beijing, Peoples R China
[2] Zhejiang Univ, Hangzhou, Peoples R China
[3] Peking Univ, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00660
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. In addition, Painter significantly outperforms recent generalist models on several challenging tasks.
引用
收藏
页码:6830 / 6839
页数:10
相关论文
共 50 条
  • [41] Learning visual representations using images with captions
    Quattoni, Ariadna
    Collins, Michael
    Darrell, Trevor
    2007 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-8, 2007, : 1553 - 1560
  • [42] OSINGA,BERT PAINTER OF PORNOGRAPHIC IMAGES
    MOERBEEK, T
    MAATSTAF, 1992, 40 (8-9): : 123 - 138
  • [43] Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering
    Wu, Zhiyong
    Wang, Yaoxiang
    Ye, Jiacheng
    Kong, Lingpeng
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1423 - 1436
  • [44] DENNETT, MENTAL IMAGES, AND IMAGES IN CONTEXT
    RUSSOW, LM
    PHILOSOPHY AND PHENOMENOLOGICAL RESEARCH, 1985, 45 (04) : 581 - 593
  • [45] Visual modelling: from images to images
    Pollefeys, M
    Van Gool, L
    JOURNAL OF VISUALIZATION AND COMPUTER ANIMATION, 2002, 13 (04): : 199 - 209
  • [46] Actions speak louder than images
    Mandavilli, Apoorva
    NATURE, 2006, 444 (7120) : 664 - 665
  • [47] Actions speak louder than images
    Apoorva Mandavilli
    Nature, 2006, 444 : 664 - 665
  • [48] Robustness of Named Entity Replacements for In-Context Learning
    Goodarzi, Saeed
    Kagita, Nikhil
    Minn, Dennis
    Wang, Shufan
    Dessi, Roberto
    Toshniwal, Shubham
    Williams, Adina
    Lanchantin, Jack
    Sinha, Koustuv
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 10914 - 10931
  • [49] Prompt Optimization via Adversarial In-Context Learning
    Do, Xuan Long
    Zhao, Yiran
    Brown, Hannah
    Xie, Yuxi
    Zhao, James Xu
    Chen, Nancy F.
    Kawaguchi, Kenji
    Shieh, Michael
    He, Junxian
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7308 - 7327
  • [50] Learning Hierarchical Context for Action Recognition in Still Images
    Zhu, Haisheng
    Hu, Jian-Fang
    Zheng, Wei-Shi
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 67 - 77