See and Learn More: Dense Caption-Aware Representation for Visual Question Answering

被引:9
|
作者
Bi, Yandong [1 ]
Jiang, Huajie [1 ]
Hu, Yongli [1 ]
Sun, Yanfeng [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
关键词
Visualization; Cognition; Question answering (information retrieval); Feature extraction; Semantics; Data mining; Detectors; Visual question answering; language prior; dense caption; cross-modal fusion;
D O I
10.1109/TCSVT.2023.3291379
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With the rapid development of deep learning models, great improvements have been achieved in the Visual Question Answering (VQA) field. However, modern VQA models are easily affected by language priors, which ignore image information and learn the superficial relationship between questions and answers, even in the optimal pre-training model. The main reason is that visual information is not fully extracted and utilized, which results in a domain gap between vision and language modalities to a certain extent. In order to mitigate the circumstances, we propose to extract dense captions (auxiliary semantic information) from images to enhance the visual information for reasoning and utilize them to release the gap between vision and language since the dense captions and the questions are from the same language modality (i.e., phrase or sentence). In this paper, we propose a novel dense caption-aware visual question answering model called DenseCapBert to enhance visual reasoning. Specifically, we generate dense captions for the images and propose a multimodal interaction mechanism to fuse dense captions, images, and questions in a unified framework, which makes the VQA models more robust. The experimental results on GQA, GQA-OOD, VQA v2, and VQA-CP v2 datasets show that dense captions are beneficial to improving the model generalization and our model effectively mitigates the language bias problem.
引用
收藏
页码:1135 / 1146
页数:12
相关论文
共 50 条
  • [1] ConceptBert: Concept-Aware Representation for Visual Question Answering
    Garderes, Francois
    Ziaeefard, Maryam
    Abeloos, Baptiste
    Lecue, Freddy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 489 - 498
  • [2] Visual question answering algorithm based on image caption
    Cai, Wenliang
    Qiu, Guoyong
    PROCEEDINGS OF 2019 IEEE 3RD INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2019), 2019, : 2076 - 2079
  • [3] Leveraging Visual Question Answering for Image-Caption Ranking
    Lin, Xiao
    Parikh, Devi
    COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 : 261 - 277
  • [4] Visual Question Answering with Question Representation Update (QRU)
    Li, Ruiyu
    Jia, Jiaya
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [5] Using similarity based image caption to aid visual question answering
    Kang, Joonseo
    Lim, Changwon
    KOREAN JOURNAL OF APPLIED STATISTICS, 2021, 34 (02) : 191 - 204
  • [6] CHANGE-AWARE VISUAL QUESTION ANSWERING
    Yuan, Zhenghang
    Mou, Lichao
    Zhu, Xiao Xiang
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 227 - 230
  • [7] Question-aware dynamic scene graph of local semantic representation learning for visual question answering
    Wu, Jinmeng
    Ge, Fulin
    Hong, Hanyu
    Shi, Yu
    Hao, Yanbin
    Ma, Lei
    PATTERN RECOGNITION LETTERS, 2023, 170 : 93 - 99
  • [8] Mood-aware visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    Dong, Ming
    NEUROCOMPUTING, 2019, 330 : 305 - 316
  • [9] A Survey on Representation Learning in Visual Question Answering
    Sahani, Manish
    Singh, Priyadarshan
    Jangpangi, Sachin
    Kumar, Shailender
    MACHINE LEARNING AND BIG DATA ANALYTICS (PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND BIG DATA ANALYTICS (ICMLBDA) 2021), 2022, 256 : 326 - 336
  • [10] STRUCTURED SEMANTIC REPRESENTATION FOR VISUAL QUESTION ANSWERING
    Yu, Dongchen
    Gao, Xing
    Xiong, Hongkai
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 2286 - 2290