Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

被引:2
|
作者
Cao, Liangfu [1 ]
Gao, Lianli [1 ]
Song, Jingkuan [1 ]
Xu, Xing [1 ]
Shen, Heng Tao [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
D O I
10.1007/978-3-319-68155-9_19
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Question Answering (VQA) has emerged as a prominent multi-discipline research problem in artificial intelligence. A number of recent studies are focusing on proposing attention mechanisms such as visual attention ("where to look") or question attention ("what words to listen to"), and they have been proved to be efficient for VQA. However, they focus on modeling the prediction error, but ignore the semantic correlation between image attention and question attention. As a result, it will inevitably result in suboptimal attentions. In this paper, we argue that in addition to modeling visual and question attentions, it is equally important to model their semantic correlation to learn them jointly as well as to facilitate their joint representation learning for VQA. In this paper, we propose a novel end-to-end model to jointly learn attentions with semantic cross-modal correlation for efficiently solving the VQA problem. Specifically, we propose a multi-modal embedding to map the visual and question attentions into a joint space to guarantee their semantic consistency. Experimental results on the benchmark datasets demonstrate that our model outperforms several state-of-the-art techniques for VQA.
引用
收藏
页码:248 / 260
页数:13
相关论文
共 50 条
  • [11] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
    Reichman, Benjamin
    Heck, Larry
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834
  • [12] Structured Attentions for Visual Question Answering
    Zhu, Chen
    Zhao, Yanpeng
    Huang, Shuaiyi
    Tu, Kewei
    Ma, Yi
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1300 - 1309
  • [13] ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Zhao, Zhiyun
    Zhang, Bo
    Cheng, Lei
    Li, Zhoujun
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (06) : 4520 - 4533
  • [14] Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation
    Hua, Yan
    Wang, Shuhui
    Liu, Siyuan
    Cai, Anni
    Huang, Qingming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (06) : 1201 - 1216
  • [15] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [16] Deep Semantic Correlation with Adversarial Learning for Cross-Modal Retrieval
    Hua, Yan
    Du, Jianhe
    PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 252 - 255
  • [17] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
    Liu, Yang
    Li, Guanbin
    Lin, Liang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11624 - 11641
  • [18] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [19] Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Chen, Mingzhe
    Wang, Zhe
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7151 - 7159
  • [20] Lightweight recurrent cross-modal encoder for video question answering
    Immanuel, Steve Andreas
    Jeong, Cheol
    KNOWLEDGE-BASED SYSTEMS, 2023, 276