Visual question answering with attention transfer and a cross-modal gating mechanism

被引:18
|
作者
Li, Wei [1 ]
Sun, Jianhui [1 ]
Liu, Ge [1 ]
Zhao, Linglan [1 ]
Fang, Xiangzhong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China
关键词
Attention; Visual question answering; Gating;
D O I
10.1016/j.patrec.2020.02.031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) is challenging since it requires to understand both language information and corresponding visual contents. A lot of efforts have been made to capture single-step language and visual interactions. However, answering complex questions requires multiple steps of reasoning which gradually adjusts the region of interest to the most relevant part of the given image, which has not been well investigated. To integrate question related object relations into attention mechanism, we propose a multi-step attention architecture to facilitate the modeling of multi-modal correlations. Firstly, an attention transfer mechanism is integrated to gradually adjust the region of interest considering reasoning representation of questions. Secondly, we propose a cross-modal gating strategy to filter out irrelevant information based on multi-modal correlations. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset, which verifies the effectiveness of our proposed method. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页码:334 / 340
页数:7
相关论文
共 50 条
  • [21] Visual question answering method based on relational reasoning and gating mechanism
    Wang X.
    Chen Q.-H.
    Sun Q.
    Jia Y.-B.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2022, 56 (01): : 36 - 46
  • [22] Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
    Zhu, Zihao
    Yu, Jing
    Wang, Yujing
    Sun, Yajing
    Hu, Yue
    Wu, Qi
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1097 - 1103
  • [23] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [24] Enhancing Visual Question Answering with Prompt-based Learning: A Cross-modal Approach for Deep Semantic Understanding
    Zhu, Shuaiyu
    Peng, Shuo
    Chen, Shengbo
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ALGORITHMS, SOFTWARE ENGINEERING, AND NETWORK SECURITY, ASENS 2024, 2024, : 713 - 717
  • [25] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
    Xia, Qihao
    Yu, Chao
    Hou, Yinong
    Peng, Pingping
    Zheng, Zhengqi
    Chen, Wen
    ELECTRONICS, 2022, 11 (11)
  • [26] CroMIC-QA: The Cross-Modal Information Complementation Based Question Answering
    Qian, Shun
    Liu, Bingquan
    Sun, Chengjie
    Xu, Zhen
    Ma, Lin
    Wang, Baoxun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8348 - 8359
  • [27] VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
    Liu, Yang
    Tan, Ying
    Luo, Jingzhou
    Chen, Weixing
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 309 - 322
  • [28] Cross-modal generality of the gating deficit
    Edgar, JC
    Miller, GA
    Moses, SN
    Thoma, RJ
    Huang, MX
    Hanlon, FM
    Weisend, MP
    Sherwood, A
    Bustillo, J
    Adler, LE
    Cañive, JM
    PSYCHOPHYSIOLOGY, 2005, 42 (03) : 318 - 327
  • [30] Cross-modal transfer in visual and haptic object categorization
    Gaissert, N.
    Waterkamp, S.
    Van Dam, L.
    Buelthoff, I.
    PERCEPTION, 2011, 40 : 134 - 134