Visual question answering with attention transfer and a cross-modal gating mechanism

被引:18
|
作者
Li, Wei [1 ]
Sun, Jianhui [1 ]
Liu, Ge [1 ]
Zhao, Linglan [1 ]
Fang, Xiangzhong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China
关键词
Attention; Visual question answering; Gating;
D O I
10.1016/j.patrec.2020.02.031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) is challenging since it requires to understand both language information and corresponding visual contents. A lot of efforts have been made to capture single-step language and visual interactions. However, answering complex questions requires multiple steps of reasoning which gradually adjusts the region of interest to the most relevant part of the given image, which has not been well investigated. To integrate question related object relations into attention mechanism, we propose a multi-step attention architecture to facilitate the modeling of multi-modal correlations. Firstly, an attention transfer mechanism is integrated to gradually adjust the region of interest considering reasoning representation of questions. Secondly, we propose a cross-modal gating strategy to filter out irrelevant information based on multi-modal correlations. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset, which verifies the effectiveness of our proposed method. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页码:334 / 340
页数:7
相关论文
共 50 条
  • [1] Medical visual question answering with symmetric interaction attention and cross-modal gating
    Chen, Zhi
    Zou, Beiji
    Dai, Yulan
    Zhu, Chengzhang
    Kong, Guilan
    Zhang, Wensheng
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 85
  • [2] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [3] HUMAN GUIDED CROSS-MODAL REASONING WITH SEMANTIC ATTENTION LEARNING FOR VISUAL QUESTION ANSWERING
    Liao, Lei
    Feng, Mao
    Yang, Meng
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 2775 - 2779
  • [4] Cross-Modal Visual Question Answering for Remote Sensing Data
    Felix, Rafael
    Repasky, Boris
    Hodge, Samuel
    Zolfaghari, Reza
    Abbasnejad, Ehsan
    Sherrah, Jamie
    2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
  • [5] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
  • [6] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [7] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [8] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
    Yu, Jing
    Zhang, Weifeng
    Lu, Yuhang
    Qin, Zengchang
    Hu, Yue
    Tan, Jianlong
    Wu, Qi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
  • [9] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    PATTERN RECOGNITION, 2020, 108
  • [10] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
    Reichman, Benjamin
    Heck, Larry
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834