Visual Dialog

被引:22
作者
Das, Abhishek [1 ]
Kottur, Satwik [2 ]
Gupta, Khushi [2 ]
Singh, Avi [3 ]
Yadav, Deshraj [1 ]
Lee, Stefan [1 ]
Moura, Jose M. F. [2 ]
Parikh, Devi [1 ,4 ]
Batra, Dhruv [1 ,4 ]
机构
[1] Georgia Tech, Atlanta, GA 30332 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Univ Calif Berkeley, Berkeley, CA 94720 USA
[4] Facebook AI Res, Menlo Pk, CA USA
关键词
Visual dialog; computer vision; natural language processing; machine learning; GAME; GO;
D O I
10.1109/TPAMI.2018.2828437
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the task of Visual Dialog, which requires an Al agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of similar to 1.2M dialog question-answer pairs from 10-round, human-human dialogs grounded in similar to 120k images from the COCO dataset. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders-Late Fusion, Hierarchical Recurrent Encoder and Memory Network (optionally with attention over image features)-and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the Al agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank and recall@k of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first 'visual chatbot'! Our dataset, code, pretrained models and visual chatbot are available on https://visualdialog.org.
引用
收藏
页码:1242 / 1256
页数:15
相关论文
共 78 条
[1]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[2]  
[Anonymous], 2015, NIPS
[3]  
[Anonymous], P IEEE C COMP VIS PA
[4]  
[Anonymous], P N AM ASS COMP LING
[5]  
[Anonymous], 2017, P INT JOINT C ART IN
[6]  
[Anonymous], P N AM ASS COMP LING
[7]  
[Anonymous], PROC CVPR IEEE
[8]  
[Anonymous], P C NEUR INF PROC SY
[9]  
[Anonymous], P 5 AAAI C HUM COMP
[10]  
[Anonymous], ELIZA