Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

被引:0
作者
Mateusz Malinowski
Marcus Rohrbach
Mario Fritz
机构
[1] Max Planck Institute for Informatics,
[2] Saarland Informatics Campus,undefined
[3] UC Berkeley EECS,undefined
来源
International Journal of Computer Vision | 2017年 / 125卷
关键词
Computer vision; Scene understanding; Deep learning; Natural language processing; Visual turing test; Visual question answering;
D O I
暂无
中图分类号
学科分类号
摘要
We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.
引用
收藏
页码:110 / 135
页数:25
相关论文
共 19 条
[1]  
Cohen J(1960)A coefficient of agreement for nominal scales Educational and psychological measurement 20 37-46
[2]  
Fleiss JL(1973)The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability Educational and psychological measurement 33 613-619
[3]  
Cohen J(1997)Long short-term memory Neural Computation 9 1735-1780
[4]  
Hochreiter S(2013)Jointly learning to parse and perceive: Connecting natural language to the physical world Transactions of the Association for Computational Linguistics (TACL) 1 193-206
[5]  
Schmidhuber J(2013)Learning dependency-based compositional semantics Computational Linguistics 39 389-446
[6]  
Krishnamurthy J(2013)Grounding action descriptions in videos Transactions of the Association for Computational Linguistics (TACL) 1 25-36
[7]  
Kollar T(2015)Image question answering: A visual semantic embedding model and a new dataset Advances in Neural Information Processing Systems (NIPS) 1 5-undefined
[8]  
Liang P(undefined)undefined undefined undefined undefined-undefined
[9]  
Jordan MI(undefined)undefined undefined undefined undefined-undefined
[10]  
Klein D(undefined)undefined undefined undefined undefined-undefined