Visual Dialog

被引:22
作者
Das, Abhishek [1 ]
Kottur, Satwik [2 ]
Gupta, Khushi [2 ]
Singh, Avi [3 ]
Yadav, Deshraj [1 ]
Lee, Stefan [1 ]
Moura, Jose M. F. [2 ]
Parikh, Devi [1 ,4 ]
Batra, Dhruv [1 ,4 ]
机构
[1] Georgia Tech, Atlanta, GA 30332 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Univ Calif Berkeley, Berkeley, CA 94720 USA
[4] Facebook AI Res, Menlo Pk, CA USA
关键词
Visual dialog; computer vision; natural language processing; machine learning; GAME; GO;
D O I
10.1109/TPAMI.2018.2828437
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the task of Visual Dialog, which requires an Al agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of similar to 1.2M dialog question-answer pairs from 10-round, human-human dialogs grounded in similar to 120k images from the COCO dataset. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders-Late Fusion, Hierarchical Recurrent Encoder and Memory Network (optionally with attention over image features)-and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the Al agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank and recall@k of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first 'visual chatbot'! Our dataset, code, pretrained models and visual chatbot are available on https://visualdialog.org.
引用
收藏
页码:1242 / 1256
页数:15
相关论文
共 78 条
[51]   Microsoft COCO: Common Objects in Context [J].
Lin, Tsung-Yi ;
Maire, Michael ;
Belongie, Serge ;
Hays, James ;
Perona, Pietro ;
Ramanan, Deva ;
Dollar, Piotr ;
Zitnick, C. Lawrence .
COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755
[52]  
Liu Chia-Wei, 2016, P C EMP METH NAT LAN
[53]  
Lu J., 2015, Deeper lstm and normalized cnn visual question answering model
[54]  
Lu JS, 2016, ADV NEUR IN, V29
[55]  
Malinowski M, 2014, ADV NEUR IN, V27
[56]   Ask Your Neurons: A Neural-based Approach to Answering Questions about Images [J].
Malinowski, Mateusz ;
Rohrbach, Marcus ;
Fritz, Mario .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1-9
[57]  
Massiceti Daniela, 2018, P IEEE C COMP VIS PA
[58]  
Mei H., 2016, P AAAI C ART INT
[59]   Human-level control through deep reinforcement learning [J].
Mnih, Volodymyr ;
Kavukcuoglu, Koray ;
Silver, David ;
Rusu, Andrei A. ;
Veness, Joel ;
Bellemare, Marc G. ;
Graves, Alex ;
Riedmiller, Martin ;
Fidjeland, Andreas K. ;
Ostrovski, Georg ;
Petersen, Stig ;
Beattie, Charles ;
Sadik, Amir ;
Antonoglou, Ioannis ;
King, Helen ;
Kumaran, Dharshan ;
Wierstra, Daan ;
Legg, Shane ;
Hassabis, Demis .
NATURE, 2015, 518 (7540) :529-533
[60]  
Mostafazadeh Nasrin, 2017, IJCNLP