See and Learn More: Dense Caption-Aware Representation for Visual Question Answering

被引:9
|
作者
Bi, Yandong [1 ]
Jiang, Huajie [1 ]
Hu, Yongli [1 ]
Sun, Yanfeng [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
关键词
Visualization; Cognition; Question answering (information retrieval); Feature extraction; Semantics; Data mining; Detectors; Visual question answering; language prior; dense caption; cross-modal fusion;
D O I
10.1109/TCSVT.2023.3291379
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With the rapid development of deep learning models, great improvements have been achieved in the Visual Question Answering (VQA) field. However, modern VQA models are easily affected by language priors, which ignore image information and learn the superficial relationship between questions and answers, even in the optimal pre-training model. The main reason is that visual information is not fully extracted and utilized, which results in a domain gap between vision and language modalities to a certain extent. In order to mitigate the circumstances, we propose to extract dense captions (auxiliary semantic information) from images to enhance the visual information for reasoning and utilize them to release the gap between vision and language since the dense captions and the questions are from the same language modality (i.e., phrase or sentence). In this paper, we propose a novel dense caption-aware visual question answering model called DenseCapBert to enhance visual reasoning. Specifically, we generate dense captions for the images and propose a multimodal interaction mechanism to fuse dense captions, images, and questions in a unified framework, which makes the VQA models more robust. The experimental results on GQA, GQA-OOD, VQA v2, and VQA-CP v2 datasets show that dense captions are beneficial to improving the model generalization and our model effectively mitigates the language bias problem.
引用
收藏
页码:1135 / 1146
页数:12
相关论文
共 50 条
  • [21] VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering
    Narayanan, Abhishek
    Rao, Abijna
    Prasad, Abhishek
    Natarajan, S.
    IMAGE AND VISION COMPUTING, 2021, 116
  • [22] An Effective Dense Co-Attention Networks for Visual Question Answering
    He, Shirong
    Han, Dezhi
    SENSORS, 2020, 20 (17) : 1 - 15
  • [23] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin-Chen
    Pattern Recognition, 2022, 132
  • [24] Language-aware Visual Semantic Distillation for Video Question Answering
    Zou, Bo
    Yang, Chao
    Qiao, Yu
    Quan, Chengbin
    Zhao, Youjian
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27103 - 27113
  • [25] Boosting Visual Question Answering with Context-aware Knowledge Aggregation
    Li, Guohao
    Wang, Xin
    Zhu, Wenwu
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1227 - 1235
  • [26] Semantic-Aware Modular Capsule Routing for Visual Question Answering
    Han, Yudong
    Yin, Jianhua
    Wu, Jianlong
    Wei, Yinwei
    Nie, Liqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 5537 - 5549
  • [27] Relation-Aware Image Captioning for Explainable Visual Question Answering
    Tseng, Ching-Shan
    Lin, Ying-Jia
    Kao, Hung-Yu
    2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154
  • [28] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin -Chen
    PATTERN RECOGNITION, 2022, 132
  • [29] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
  • [30] Adversarial Learning of Answer-Related Representation for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Li, Zhoujun
    CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 1013 - 1022