SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

被引：0

作者：

Chen, Yi-Syuan ^{[1
]}

Song, Yun-Zhu ^{[1
]}

Yeo, Cheng Yu ^{[1
]}

Liu, Bei ^{[2
]}

Fu, Jianlong ^{[2
]}

Shuai, Hong-Han ^{[1
]}

机构：

[1] Natl Yang Ming Chiao Tung Univ, Hsinchu, Taiwan

[2] Microsoft Res Asia, Beijing, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.01415

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: "How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making incontext predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.

引用

页码：15384 / 15396

页数：13

共 50 条

[21] Self-Supervised Learning for Semi-Supervised Temporal Language Grounding
Luo, Fan
Chen, Shaoxiang
Chen, Jingjing
Wu, Zuxuan
Jiang, Yu-Gang
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7747 - 7757
[22] Deciphering the language of antibodies using self-supervised learning
Leem, Jinwoo
Mitchell, Laura S.
Farmery, James H. R.
Barton, Justin
Galson, Jacob D.
PATTERNS, 2022, 3 (07):
[23] data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Baevski, Alexei
Hsu, Wei-Ning
Xu, Qiantong
Babu, Arun
Gu, Jiatao
Auli, Michael
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[24] Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation
Tan, Sinan
Sima, Kuankuan
Wang, Dunzheng
Ge, Mengmeng
Guo, Di
Liu, Huaping
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 14
[25] Siamese Image Modeling for Self-Supervised Vision Representation Learning
Tao, Chenxin
Zhu, Xizhou
Su, Weijie
Huang, Gao
Li, Bin
Zhou, Jie
Qiao, Yu
Wang, Xiaogang
Dai, Jifeng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2132 - 2141
[26] Jointly Optimal Incremental Learning with Self-Supervised Vision Transformers
Witzgall, Hanna
2024 IEEE AEROSPACE CONFERENCE, 2024,
[27] Dissecting self-supervised learning methods for surgical computer vision
Ramesh, Sanat
Srivastav, Vinkle
Alapatt, Deepak
Yu, Tong
Murali, Aditya
Sestini, Luca
Nwoye, Chinedu Innocent
Hamoud, Idris
Sharma, Saurav
Fleurentin, Antoine
Exarchakis, Georgios
Karargyris, Alexandros
Padoy, Nicolas
MEDICAL IMAGE ANALYSIS, 2023, 88
[28] Language Features Matter: Effective Language Representations for Vision-Language Tasks
Burns, Andrea
Tan, Reuben
Saenko, Kate
Sclaroff, Stan
Plummer, Bryan A.
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7473 - 7482
[29] IN-CONTEXT LANGUAGE LEARNING: ARCHITECTURES AND ALGORITHMS
Akyürek, Ekin
Wang, Bailin
Kim, Yoon
Andreas, Jacob
arXiv,
[30] Weakly Supervised Grounding for VQA in Vision-Language Transformers
Khan, Aisha Urooj
Kuehne, Hilde
Gan, Chuang
Lobo, Niels Da Vitoria
Shah, Mubarak
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 652 - 670

← 1 2 3 4 5 →