Multi-view Attention Networks for Visual Question Answering

被引:0
作者
Li, Min [1 ]
Bai, Zongwen [1 ]
Deng, Jie [2 ]
机构
[1] Yanan Univ, Sch Phys & Elect Informat, Yanan, Peoples R China
[2] Sichuan Jiuzhou Elect Grp Co Ltd, Dept 705, Mianyang, Sichuan, Peoples R China
来源
2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024 | 2024年
关键词
Visual question answering; feature extraction; attention mechanism; transformer;
D O I
10.1109/ICNLP60986.2024.10692598
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) is a typical multimodal task that necessitates a combination of computer vision and natural language processing expertise. The fundamental essence of VQA lies in the simultaneous comprehension of fine-grained language and visual information. In recent years, transformer-based methods have exhibited remarkable success in advancing the state-of-the-art in VQA. In this paper, we present an enhanced model, namely the Multi-view Attention Network (MVAN), a variant of the Transformer architecture. MVAN improves the effect of the model in filtering irrelevant information and focusing on local features. Specifically, we augment our network with a Gated Linear Unit (GLU) to discern and filter irrelevant or inconsequential information. Additionally, a Gated Convolution Block (GCB) is introduced to the self-attention layer of the Transformer variant. This integration facilitates the extraction of contextual semantic image information by considering both channel and spatial perspectives. As a result, the model effectively combines both local and global information, hence improving its predicted accuracy in VQA tasks. Ultimately, the model is subjected to verification and testing using the VQA-v2 dataset. The outcomes of these evaluations demonstrate a notable enhancement in the performance of our model when compared to the existing methods. Furthermore,we also conduct extensive ablation experiments to explore the reasons for the effectiveness of the MVAN.
引用
收藏
页码:788 / 794
页数:7
相关论文
共 29 条
[1]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[2]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[3]   Multimodal Encoder-Decoder Attention Networks for Visual Question Answering [J].
Chen, Chongqing ;
Han, Dezhi ;
Wang, Jun .
IEEE ACCESS, 2020, 8 :35662-35671
[4]   Exploring Logical Reasoning for Referring Expression Comprehension [J].
Cheng, Ying ;
Wang, Ruize ;
Yu, Jiashuo ;
Zhao, Rui-Wei ;
Zhang, Yuejie ;
Feng, Rui .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5047-5055
[5]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[6]   Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].
Goyal, Yash ;
Khot, Tejas ;
Summers-Stay, Douglas ;
Batra, Dhruv ;
Parikh, Devi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334
[7]   Cross-modality co-attention networks for visual question answering [J].
Han, Dezhi ;
Zhou, Shuli ;
Li, Kuan Ching ;
de Mello, Rodrigo Fernandes .
SOFT COMPUTING, 2021, 25 (07) :5411-5421
[8]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[9]   Attention on Attention for Image Captioning [J].
Huang, Lun ;
Wang, Wenmin ;
Chen, Jie ;
Wei, Xiao-Yong .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4633-4642
[10]   In Defense of Grid Features for Visual Question Answering [J].
Jiang, Huaizu ;
Misra, Ishan ;
Rohrbach, Marcus ;
Learned-Miller, Erik ;
Chen, Xinlei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10264-10273