Robust data augmentation and contrast learning for debiased visual question answering

被引：0

作者：

Ning, Ke ^{[1
,2
]}

Li, Zhixin ^{[1
,2
]}

机构：

[1] Guangxi Normal Univ, Key Lab Educ Blockchain & Intelligent Technol, Minist Educ, Guilin 541004, Peoples R China

[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 626卷

基金：

中国国家自然科学基金;

关键词：

Visual question answering; Language priors; Data augmentation; Knowledge distillation; Contrastive learning;

D O I：

10.1016/j.neucom.2025.129527

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The language prior problem in VQA causes the model to make predictions directly based on spurious correlations between questions and answers, causing the model's performance to drop sharply outside the distribution. Current debiasing methods often achieve good out-of-distribution generalization capabilities at the expense of significant in-distribution performance degradation, while non-debiasing methods sacrifice a large amount of out-of-distribution performance to achieve high in-distribution performance. We propose a novel method combining multi-teacher knowledge distillation and contrastive learning (MKDCL) to solve the language prior problem in VQA. We propose a Question Answer Selection (QAS) module to select reasonable questions for images, which also determines the pseudo answers with multi-teacher's weighted predictions. The Contrastive Learning Samples Generation (CLSG) module we propose synthesizes four types of positive and negative samples in visual and language modalities for contrastive learning, effectively increases the semantic dependency of the images while avoiding performance degradation due to spurious correlations between questions and answers. Our method is model-agnostic and achieves state-of-the-art performance (62.93%) on the language prior-sensitive VQA-CP v2 dataset while maintaining performance (65.43%) on the VQA v2 dataset.

引用

页数：11

共 54 条

[1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Agrawal, Aishwarya
Batra, Dhruv
Parikh, Devi
Kembhavi, Aniruddha
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4971 - 4980
[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[3] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[4] RMLVQA: A Margin Loss Approach For Visual Question Answering with Language Biases
Basu, Abhipsa
Addepalli, Sravanti
Babu, R. Venkatesh
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11671 - 11680
[5] Self-supervised knowledge distillation in counterfactual learning for VQA
Bi, Yandong
Jiang, Huajie
Zhang, Hanfu
Hu, Yongli
Yin, Baocai
[J]. PATTERN RECOGNITION LETTERS, 2024, 177 : 33 - 39
[6] Cadene R, 2019, ADV NEUR IN, V32
[7] Cao RL, 2023, PR MACH LEARN RES, V216, P249
[8] Mix-tower: Light visual question answering framework based on exclusive self-attention mechanism
Chen, Deguang
Chen, Jianrui
Yang, Luheng
Shang, Fanhua
[J]. NEUROCOMPUTING, 2024, 587
[9] Rethinking Data Augmentation for Robust Visual Question Answering
Chen, Long
Zheng, Yuhang
Xiao, Jun
[J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 95 - 112
[10] Chen L, 2020, PROC CVPR IEEE, P10797, DOI 10.1109/CVPR42600.2020.01081

← 1 2 3 4 5 6 →