Visual Question Answering With a Hybrid Convolution Recurrent Model

被引:3
作者
Harzig, Philipp [1 ]
Eggert, Christian [1 ]
Lienhart, Rainer [1 ]
机构
[1] Univ Augsburg, Augsburg, Germany
来源
ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL | 2018年
关键词
VQA; Visual Question Answering; multimodal retrieval; natural language generation; LSTM; multimodal fusion;
D O I
10.1145/3206025.3206054
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Question Answering (VQA) is a relatively new task, which tries to infer answer sentences for an input image coupled with a corresponding question. Instead of dynamically generating answers, they are usually inferred by finding the most probable answer from a fixed set of possible answers. Previous work did not address the problem of finding all possible answers, but only modeled the answering part of VQA as a classification task. To tackle this problem, we infer answer sentences by using a Long Short-Term Memory (LSTM) network that allows us to dynamically generate answers for (image, question) pairs. In a series of experiments, we discover an end-to-end Deep Neural Network structure, which allows us to dynamically answer questions referring to a given input image by using an LSTM decoder network. With this approach, we are able to generate both less common answers, which are not considered by classification models, and more complex answers with the appearance of datasets containing answers that consist of more than three words.
引用
收藏
页码:318 / 325
页数:8
相关论文
共 30 条
  • [1] Abadi M., 2016, TENSORFLOW LARGESCAL
  • [2] [Anonymous], ARXIV170803619
  • [3] [Anonymous], ARXIV170802711
  • [4] [Anonymous], Simple baseline for visual question answering
  • [5] [Anonymous], 2015, Microsoft coco captions: Data collection and evaluation server
  • [6] [Anonymous], 2011, P 15 C COMP NAT LANG
  • [7] [Anonymous], 2016, Google's neural machine translation system: Bridging the gap between human and machine translation
  • [8] [Anonymous], 2015, Deeper lstm and normalized cnn visual question answering model
  • [9] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [10] Barnard K, 2001, EIGHTH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOL II, PROCEEDINGS, P408, DOI 10.1109/ICCV.2001.937654