Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering

被引：1

作者：

Zhang, Haotian ^{[1
]}

Wu, Wei ^{[1
]}

机构：

[1] Inner Mongolia Univ, Comp Sci Dept, Hohhot, Peoples R China

来源：

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年

基金：

中国国家自然科学基金;

关键词：

Visual question answering; transformer; multi-modal task; cross-modal gate fusion;

D O I：

10.1109/IJCNN55064.2022.9891887

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Given an image and an opened-ended question related to the picture, the Visual Question Answering (VQA) model aims to provide the right answer to the question for the image. This is a challenging task that requires a fine-grained simultaneous realization of the visual content of the image and the textual content of the question. However, most current models ignore the region-to-word interaction and the noise effect of irrelevant words. This paper proposes a novel model called Transformer Gate Attention Model (TGAM) to capture the inter-modal information dependence to solve the above problems. Specifically, TGAM is composed of Adaptive Gate (AG) and Parallel Transformer Module (PTM). AG fuses information between different modalities and reduces noise, and PTM obtains more advanced cross-modal feature representation. Many qualitative and quantitative experiments were conducted on VQA-v2 dataset to verify the validity of TGAM. And extensive ablation studies have been conducted to explore the reasons behind the effectiveness of TGAM. Experimental results show that the performance of TGAM is significantly better than previous state-of-the-art technologies. Our best model achieved 71.28% overall accuracy on the test-dev set and 71.6% on the test-std set.

引用

页数：7

共 43 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[2] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[3] Ba JL., 2016, ARXIV
[4] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[5] Chen K, 2016, Arxiv, DOI [arXiv:1511.05960, DOI 10.48550/ARXIV.1511.05960,ARXIV]
[6] Chorowski J, 2015, ADV NEUR IN, V28
[7] Long-Term Recurrent Convolutional Networks for Visual Recognition and Description
Donahue, Jeff
Hendricks, Lisa Anne
Rohrbach, Marcus
Venugopalan, Subhashini
Guadarrama, Sergio
Saenko, Kate
Darrell, Trevor
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) : 677 - 691
[8] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
Duy-Kien Nguyen
Okatani, Takayuki
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
[9] Fukui A., 2016, arXiv
[10] Gao HY, 2015, ADV NEUR IN, V28

← 1 2 3 4 5 →