SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

被引:0
|
作者
Chen, Yi-Syuan [1 ]
Song, Yun-Zhu [1 ]
Yeo, Cheng Yu [1 ]
Liu, Bei [2 ]
Fu, Jianlong [2 ]
Shuai, Hong-Han [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Hsinchu, Taiwan
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.01415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: "How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making incontext predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.
引用
收藏
页码:15384 / 15396
页数:13
相关论文
共 50 条
  • [21] Self-Supervised Learning for Semi-Supervised Temporal Language Grounding
    Luo, Fan
    Chen, Shaoxiang
    Chen, Jingjing
    Wu, Zuxuan
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7747 - 7757
  • [22] Deciphering the language of antibodies using self-supervised learning
    Leem, Jinwoo
    Mitchell, Laura S.
    Farmery, James H. R.
    Barton, Justin
    Galson, Jacob D.
    PATTERNS, 2022, 3 (07):
  • [23] data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
    Baevski, Alexei
    Hsu, Wei-Ning
    Xu, Qiantong
    Babu, Arun
    Gu, Jiatao
    Auli, Michael
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [24] Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation
    Tan, Sinan
    Sima, Kuankuan
    Wang, Dunzheng
    Ge, Mengmeng
    Guo, Di
    Liu, Huaping
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 14
  • [25] Siamese Image Modeling for Self-Supervised Vision Representation Learning
    Tao, Chenxin
    Zhu, Xizhou
    Su, Weijie
    Huang, Gao
    Li, Bin
    Zhou, Jie
    Qiao, Yu
    Wang, Xiaogang
    Dai, Jifeng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2132 - 2141
  • [26] Jointly Optimal Incremental Learning with Self-Supervised Vision Transformers
    Witzgall, Hanna
    2024 IEEE AEROSPACE CONFERENCE, 2024,
  • [27] Dissecting self-supervised learning methods for surgical computer vision
    Ramesh, Sanat
    Srivastav, Vinkle
    Alapatt, Deepak
    Yu, Tong
    Murali, Aditya
    Sestini, Luca
    Nwoye, Chinedu Innocent
    Hamoud, Idris
    Sharma, Saurav
    Fleurentin, Antoine
    Exarchakis, Georgios
    Karargyris, Alexandros
    Padoy, Nicolas
    MEDICAL IMAGE ANALYSIS, 2023, 88
  • [28] Language Features Matter: Effective Language Representations for Vision-Language Tasks
    Burns, Andrea
    Tan, Reuben
    Saenko, Kate
    Sclaroff, Stan
    Plummer, Bryan A.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7473 - 7482
  • [29] IN-CONTEXT LANGUAGE LEARNING: ARCHITECTURES AND ALGORITHMS
    Akyürek, Ekin
    Wang, Bailin
    Kim, Yoon
    Andreas, Jacob
    arXiv,
  • [30] Weakly Supervised Grounding for VQA in Vision-Language Transformers
    Khan, Aisha Urooj
    Kuehne, Hilde
    Gan, Chuang
    Lobo, Niels Da Vitoria
    Shah, Mubarak
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 652 - 670