Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

被引：2

作者：

Cao, Liangfu ^{[1
]}

Gao, Lianli ^{[1
]}

Song, Jingkuan ^{[1
]}

Xu, Xing ^{[1
]}

Shen, Heng Tao ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China

来源：

DATABASES THEORY AND APPLICATIONS, ADC 2017 | 2017年 / 10538卷

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1007/978-3-319-68155-9_19

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual Question Answering (VQA) has emerged as a prominent multi-discipline research problem in artificial intelligence. A number of recent studies are focusing on proposing attention mechanisms such as visual attention ("where to look") or question attention ("what words to listen to"), and they have been proved to be efficient for VQA. However, they focus on modeling the prediction error, but ignore the semantic correlation between image attention and question attention. As a result, it will inevitably result in suboptimal attentions. In this paper, we argue that in addition to modeling visual and question attentions, it is equally important to model their semantic correlation to learn them jointly as well as to facilitate their joint representation learning for VQA. In this paper, we propose a novel end-to-end model to jointly learn attentions with semantic cross-modal correlation for efficiently solving the VQA problem. Specifically, we propose a multi-modal embedding to map the visual and question attentions into a joint space to guarantee their semantic consistency. Experimental results on the benchmark datasets demonstrate that our model outperforms several state-of-the-art techniques for VQA.

引用

页码：248 / 260

页数：13

共 50 条

[11] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
Reichman, Benjamin
Heck, Larry
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834
[12] Structured Attentions for Visual Question Answering
Zhu, Chen
Zhao, Yanpeng
Huang, Shuaiyi
Tu, Kewei
Ma, Yi
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1300 - 1309
[13] ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering
Liu, Yun
Zhang, Xiaoming
Zhao, Zhiyun
Zhang, Bo
Cheng, Lei
Li, Zhoujun
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (06) : 4520 - 4533
[14] Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation
Hua, Yan
Wang, Shuhui
Liu, Siyuan
Cai, Anni
Huang, Qingming
IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (06) : 1201 - 1216
[15] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
Lyu, Chenyang
Li, Wenxi
Ji, Tianbo
Zhou, Liting
Gurrin, Cathal
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
[16] Deep Semantic Correlation with Adversarial Learning for Cross-Modal Retrieval
Hua, Yan
Du, Jianhe
PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 252 - 255
[17] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Liu, Yang
Li, Guanbin
Lin, Liang
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11624 - 11641
[18] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
IEEE ACCESS, 2018, 6 : 31516 - 31524
[19] Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
Zhang, Jing
Liu, Xiaoqiang
Chen, Mingzhe
Wang, Zhe
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7151 - 7159
[20] Lightweight recurrent cross-modal encoder for video question answering
Immanuel, Steve Andreas
Jeong, Cheol
KNOWLEDGE-BASED SYSTEMS, 2023, 276

← 1 2 3 4 5 →