Visual Dialog

被引:453
作者
Das, Abhishek [1 ]
Kottur, Satwik [2 ]
Gupta, Khushi [2 ]
Singh, Avi [3 ]
Yadav, Deshraj [4 ]
Moura, Jose M. F. [2 ]
Parikh, Devi [1 ]
Batra, Dhruv [1 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[3] Univ Calif Berkeley, Berkeley, CA USA
[4] Virginia Tech, Blacksburg, VA USA
来源
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年
基金
美国国家科学基金会;
关键词
D O I
10.1109/CVPR.2017.121
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial contains 1 dialog (10 question-answer pairs) on similar to 140k images from the COCO dataset, with a total of similar to 1.4M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders (Late Fusion, Hierarchical Recurrent Encoder and Memory Network) and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify gap between machine and human performance on the Visual Dialog task via human studies. Our dataset, code, and trained models will be released publicly at visualdialog.org. Putting it all together, we demonstrate the first 'visual chatbot'!
引用
收藏
页码:1080 / 1089
页数:10
相关论文
共 61 条
  • [31] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
  • [32] Liu C.-W., 2016, P 2016 C EMP METH NA, P2122, DOI DOI 10.18653/V1/D16-1230
  • [33] SSD: Single Shot MultiBox Detector
    Liu, Wei
    Anguelov, Dragomir
    Erhan, Dumitru
    Szegedy, Christian
    Reed, Scott
    Fu, Cheng-Yang
    Berg, Alexander C.
    [J]. COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 : 21 - 37
  • [34] Lowe Ryan, 2015, Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
  • [35] Lu JS, 2016, ADV NEUR IN, V29
  • [36] Malinowski M., 2014, Advances in Neural Information Processing Systems, P1682
  • [37] Ask Your Neurons: A Neural-based Approach to Answering Questions about Images
    Malinowski, Mateusz
    Rohrbach, Marcus
    Fritz, Mario
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1 - 9
  • [38] Mei HY, 2016, AAAI CONF ARTIF INTE, P2772
  • [39] Human-level control through deep reinforcement learning
    Mnih, Volodymyr
    Kavukcuoglu, Koray
    Silver, David
    Rusu, Andrei A.
    Veness, Joel
    Bellemare, Marc G.
    Graves, Alex
    Riedmiller, Martin
    Fidjeland, Andreas K.
    Ostrovski, Georg
    Petersen, Stig
    Beattie, Charles
    Sadik, Amir
    Antonoglou, Ioannis
    King, Helen
    Kumaran, Dharshan
    Wierstra, Daan
    Legg, Shane
    Hassabis, Demis
    [J]. NATURE, 2015, 518 (7540) : 529 - 533
  • [40] Mostafazadeh Nasrin, 2017, P 8 INT JOINT C NAT