Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering

被引：2

作者：

Jiang, Jingjing ^{[1
]}

Liu, Ziyi ^{[1
]}

Zheng, Nanning ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Shaanxi, Peoples R China

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2023年 / 132卷 / 1期

基金：

美国国家科学基金会;

关键词：

Information bottleneck; Robustness; Visual question answering; Vision-language model; LANGUAGE;

D O I：

10.1007/s11263-023-01858-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy.

引用

页码：185 / 207

页数：23

共 50 条

[21] Cycle-Consistency for Robust Visual Question Answering
Shah, Meet
Chen, Xinlei
Rohrbach, Marcus
Parikh, Devi
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6642 - 6651
[22] Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI
Lee, Gyeonggeon
Zhai, Xiaoming
TECHTRENDS, 2025, : 271 - 287
[23] Incorporation of question segregation procedures in visual question-answering models
Chowdhury, Souvik
Soni, Badal
Phukan, Doli
INTERNATIONAL JOURNAL OF COMPUTING SCIENCE AND MATHEMATICS, 2024, 20 (02) : 99 - 108
[24] R-VQA: A robust visual question answering model
Chowdhury, Souvik
Soni, Badal
KNOWLEDGE-BASED SYSTEMS, 2025, 309
[25] Robust visual question answering via polarity enhancement and contrast *
Peng, Dahe
Li, Zhixin
NEURAL NETWORKS, 2024, 179
[26] Multimodal feature fusion by relational reasoning and attention for visual question answering
Zhang, Weifeng
Yu, Jing
Hu, Hua
Hu, Haiyang
Qin, Zengchang
INFORMATION FUSION, 2020, 55 (55) : 116 - 126
[27] Multimodal Cross-guided Attention Networks for Visual Question Answering
Liu, Haibin
Gong, Shengrong
Ji, Yi
Yang, Jianyu
Xing, Tengfei
Liu, Chunping
PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353
[28] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
Chen, Chongqing
Han, Dezhi
Wang, Jun
IEEE ACCESS, 2020, 8 : 35662 - 35671
[29] Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
Chen, Long
Zheng, Yuhang
Niu, Yulei
Zhang, Hanwang
Xiao, Jun
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13218 - 13234
[30] Towards Reasoning Ability in Scene Text Visual Question Answering
Wang, Qingqing
Xiao, Liqiang
Lu, Yue
Jin, Yaohui
He, Hao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289

← 1 2 3 4 5 →