HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering

被引:0
作者
Hao, Dongze [1 ,2 ]
Wang, Qunbo [1 ]
Zhu, Xinxin [1 ]
Liu, Jing [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Lab Cognit & Decis Intelligence Complex Syst, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; hierarchical counterfactual contrastive learning; robust VQA;
D O I
10.1145/3673902
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite most state-of-the-art models having achieved amazing performance in Visual Question Answering (VQA), they usually utilize biases to answer the question. Recently, some studies synthesize counterfactual training samples to help the model to mitigate the biases. However, these synthetic samples need extra annotations and often contain noises. Moreover, these methods simply add synthetic samples to the training data to train the model with the cross-entropy loss, which cannot make the best use of synthetic samples to mitigate the biases. In this article, to mitigate the biases in VQA more effectively, we propose a Hierarchical Counterfactual Contrastive Learning (HCCL) method. Firstly, to avoid introducing noises and extra annotations, our method automatically masks the unimportant features in original pairs to obtain positive samples and create mismatched question-image pairs as negative samples. Then our method uses feature-level and answer-level contrastive learning to make the original sample close to positive samples in the feature space, while away from negative samples in both feature and answer spaces. In this way, the VQA model can learn the robust multimodal features and focus on both visual and language information to produce the answer. Our HCCL method can be adopted in different baselines, and the experimental results on VQA v2, VQA-CP, and GQA-OOD datasets show that our method is effective in mitigating the biases in VQA, which improves the robustness of the VQA model.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] VQA-BC: ROBUST VISUAL QUESTION ANSWERING VIA BIDIRECTIONAL CHAINING
    Lao, Mingrui
    Guo, Yanming
    Chen, Wei
    Pu, Nan
    Lew, Michael S.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4833 - 4837
  • [42] Sequential Visual Reasoning for Visual Question Answering
    Liu, Jinlai
    Wu, Chenfei
    Wang, Xiaojie
    Dong, Xuan
    PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 410 - 415
  • [43] Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering
    Gao, Ling
    Zhang, Hongda
    Sheng, Nan
    Shi, Lida
    Xu, Hao
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [44] Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
    Malinowski, Mateusz
    Rohrbach, Marcus
    Fritz, Mario
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 125 (1-3) : 110 - 135
  • [45] Adversarial Learning of Answer-Related Representation for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Li, Zhoujun
    CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 1013 - 1022
  • [46] Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
    Mateusz Malinowski
    Marcus Rohrbach
    Mario Fritz
    International Journal of Computer Vision, 2017, 125 : 110 - 135
  • [47] Plenty is Plague: Fine-Grained Learning for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Sun, Xiaoshuai
    Su, Jinsong
    Meng, Deyu
    Gao, Yue
    Shen, Chunhua
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (02) : 697 - 709
  • [48] Overcoming language priors in visual question answering with cumulative learning strategy
    Mao, Aihua
    Chen, Feng
    Ma, Ziying
    Lin, Ken
    Neurocomputing, 2024, 608
  • [49] Question -Led object attention for visual question answering
    Gao, Lianli
    Cao, Liangfu
    Xu, Xing
    Shao, Jie
    Song, Jingkuan
    NEUROCOMPUTING, 2020, 391 : 227 - 233
  • [50] Question action relevance and editing for visual question answering
    Andeep S. Toor
    Harry Wechsler
    Michele Nappi
    Multimedia Tools and Applications, 2019, 78 : 2921 - 2935