Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering

被引:4
作者
Manmadhan, Sruthy [1 ]
Kovoor, Binsu C. [1 ]
机构
[1] Cochin Univ Sci & Technol, Div Informat Technol, Kochi 682022, Kerala, India
关键词
Attention mechanism; Deep learning; Semantic similarity; Supervised term weighting; Visual Question Answering;
D O I
10.1016/j.imavis.2021.104291
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a multi-modal challenging task that accepts an image and a natural language question about that image as inputs and desires to find the correct answer. This AI-complete task necessitates the fine-grained joint understanding of the two input modalities. Inspired by the success of attention mechanism in the task of efficient comprehension of visual-language features for VQA, this paper proposes a Multi-Tier Attention Network (MTAN) with the major component being term-weighted question-guided visual attention. Additionally, we introduce a novel Supervised Term Weighting (STW) scheme named 'qf.obj.cos' to semantically weight words utilizing the notion of visual object detection. This can be generalized to other vision-language comprehension tasks like image captioning, text-to-image-retrieval, multi-modal summarization etc. In effect, the proposed system allows the generation of more discriminative visual features from the progressive steps of question guided visual attention where question embedding is indeed guided by semantic term weighting. MTAN is quantitatively and qualitatively evaluated on the benchmark DAQUAR dataset and an extensive set of ablations are studied to demonstrate the individual significance of each of the components of the system. Experimental results certify that MTAN performs better than the previous works using the same dataset. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 49 条
[1]  
[Anonymous], 2017, ARXIV170206700
[2]  
[Anonymous], 2017, ARXIV170403162
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
[5]  
Chen K., 2015, Abc-cnn: An attention based convolutional neural network for visual question answering, DOI DOI 10.1155/2015/956757
[6]  
Cho K., 2014, P C EMP METH NAT LAN, P1724
[7]  
Debole F, 2004, STUD FUZZ SOFT COMP, V138, P81
[8]   Approximate statistical tests for comparing supervised classification learning algorithms [J].
Dietterich, TG .
NEURAL COMPUTATION, 1998, 10 (07) :1895-1923
[9]   Question -Led object attention for visual question answering [J].
Gao, Lianli ;
Cao, Liangfu ;
Xu, Xing ;
Shao, Jie ;
Song, Jingkuan .
NEUROCOMPUTING, 2020, 391 :227-233
[10]   Visual Turing test for computer vision systems [J].
Geman, Donald ;
Geman, Stuart ;
Hallonquist, Neil ;
Younes, Laurent .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2015, 112 (12) :3618-3623