Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering

被引：2

作者：

Jiang, Jingjing ^{[1
]}

Liu, Ziyi ^{[1
]}

Zheng, Nanning ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Shaanxi, Peoples R China

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2023年 / 132卷 / 1期

基金：

美国国家科学基金会;

关键词：

Information bottleneck; Robustness; Visual question answering; Vision-language model; LANGUAGE;

D O I：

10.1007/s11263-023-01858-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy.

引用

页码：185 / 207

页数：23

共 50 条

[11] Information fusion in visual question answering: A Survey
Zhang, Dongxiang
Cao, Rui
Wu, Sai
INFORMATION FUSION, 2019, 52 : 268 - 280
[12] Fair Attention Network for Robust Visual Question Answering
Bi, Yandong
Jiang, Huajie
Hu, Yongli
Sun, Yanfeng
Yin, Baocai
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 7870 - 7881
[13] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
Li, Haiyan
Han, Dezhi
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
[14] Multimodal fusion: advancing medical visual question-answering
Mudgal, Anjali
Kush, Udbhav
Kumar, Aditya
Jafari, Amir
Neural Computing and Applications, 2024, 36 (33) : 20949 - 20962
[15] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
IEEE ACCESS, 2018, 6 : 57923 - 57932
[16] On the role of question encoder sequence model in robust visual question answering
Kv, Gouthaman
Mittal, Anurag
PATTERN RECOGNITION, 2022, 131
[17] EduVQA: A multimodal Visual Question Answering framework for smart education
Xiao, Jiongen
Zhang, Zifeng
ALEXANDRIA ENGINEERING JOURNAL, 2025, 122 : 615 - 624
[18] Multimodal attention-driven visual question answering for Malayalam
Kovath A.G.
Nayyar A.
Sikha O.K.
Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
[19] Towards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models
Hong, Xingyun
Shao, Yan
Wang, Zhilin
Duan, Manni
Jin, Xiongnan
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT I, NLPCC 2024, 2025, 15359 : 228 - 242
[20] Dual-Key Multimodal Backdoors for Visual Question Answering
Walmer, Matthew
Sikka, Karan
Sur, Indranil
Shrivastava, Abhinav
Jha, Susmit
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15354 - 15364

← 1 2 3 4 5 →