Deep Attention Neural Tensor Network for Visual Question Answering

被引:56
作者
Bai, Yalong [1 ,2 ]
Fu, Jianlong [3 ]
Zhao, Tiejun [1 ]
Mei, Tao [2 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] JD AI Res, Beijing, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
来源
COMPUTER VISION - ECCV 2018, PT XII | 2018年 / 11216卷
关键词
Visual question answering; Neural tensor network; Open-ended VQA;
D O I
10.1007/978-3-030-01258-8_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) has drawn great attention in cross-modal learning problems, which enables a machine to answer a natural language question given a reference image. Significant progress has been made by learning rich embedding features from images and questions by bilinear models, while neglects the key role from answers. In this paper, we propose a novel deep attention neural tensor network (DA-NTN) for visual question answering, which can discover the joint correlations over images, questions and answers with tensor-based representations. First, we model one of the pairwise interaction (e.g., image and question) by bilinear features, which is further encoded with the third dimension (e.g., answer) to be a triplet by bilinear tensor product. Second, we decompose the correlation of different triplets by different answer and question types, and further propose a slice-wise attention module on tensor to select the most discriminative reasoning process for inference. Third, we optimize the proposed DA-NTN by learning a label regression with KL-divergence losses. Such a design enables scalable training and fast convergence over a large number of answer set. We integrate the proposed DA-NTN structure into the state-of-the-art VQA models (e.g., MLB and MUTAN). Extensive experiments demonstrate the superior accuracy than the original MLB and MUTAN models, with 1.98%, 1.70% relative increases on VQA-2.0 dataset, respectively.
引用
收藏
页码:21 / 37
页数:17
相关论文
共 34 条
[1]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[2]  
[Anonymous], 2016, C EMP METH NAT LANG
[3]  
[Anonymous], 2017, IEEE INT C COMP VIS
[4]  
[Anonymous], 2016, ADV NEURAL INFORM PR
[5]  
[Anonymous], 2017, P IEEE INT C COMP VI
[6]  
[Anonymous], Simple baseline for visual question answering
[7]  
[Anonymous], 2016, NAACL
[8]  
[Anonymous], [No title captured]
[9]  
[Anonymous], 2016, Advances in Neural Information Processing Systems
[10]  
[Anonymous], 2017, ABS170707998 CORR