Robust data augmentation and contrast learning for debiased visual question answering

被引：0

作者：

Ning, Ke ^{[1
,2
]}

Li, Zhixin ^{[1
,2
]}

机构：

[1] Guangxi Normal Univ, Key Lab Educ Blockchain & Intelligent Technol, Minist Educ, Guilin 541004, Peoples R China

[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 626卷

基金：

中国国家自然科学基金;

关键词：

Visual question answering; Language priors; Data augmentation; Knowledge distillation; Contrastive learning;

D O I：

10.1016/j.neucom.2025.129527

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The language prior problem in VQA causes the model to make predictions directly based on spurious correlations between questions and answers, causing the model's performance to drop sharply outside the distribution. Current debiasing methods often achieve good out-of-distribution generalization capabilities at the expense of significant in-distribution performance degradation, while non-debiasing methods sacrifice a large amount of out-of-distribution performance to achieve high in-distribution performance. We propose a novel method combining multi-teacher knowledge distillation and contrastive learning (MKDCL) to solve the language prior problem in VQA. We propose a Question Answer Selection (QAS) module to select reasonable questions for images, which also determines the pseudo answers with multi-teacher's weighted predictions. The Contrastive Learning Samples Generation (CLSG) module we propose synthesizes four types of positive and negative samples in visual and language modalities for contrastive learning, effectively increases the semantic dependency of the images while avoiding performance degradation due to spurious correlations between questions and answers. Our method is model-agnostic and achieves state-of-the-art performance (62.93%) on the language prior-sensitive VQA-CP v2 dataset while maintaining performance (65.43%) on the VQA v2 dataset.

引用

页数：11

共 54 条

[51] Mining graph-based dynamic relationships for object detection
Yang, Xiwei
Li, Zhixin
Zhong, Xinfang
Zhang, Canlong
Ma, Huifang
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 126
[52] Stacked Attention Networks for Image Question Answering
Yang, Zichao
He, Xiaodong
Gao, Jianfeng
Deng, Li
Smola, Alex
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 21 - 29
[53] Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering
Zhang, Liyang
Liu, Shuaicheng
Liu, Donghao
Zeng, Pengpeng
Li, Xiangpeng
Song, Jingkuan
Gao, Lianli
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (10) : 4362 - 4373
[54] Zhi X, 2020, PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1083

← 1 2 3 4 5 6 →