Visual Dialog

被引：453

作者：

Das, Abhishek ^{[1
]}

Kottur, Satwik ^{[2
]}

Gupta, Khushi ^{[2
]}

Singh, Avi ^{[3
]}

Yadav, Deshraj ^{[4
]}

Moura, Jose M. F. ^{[2
]}

Parikh, Devi ^{[1
]}

Batra, Dhruv ^{[1
]}

机构：

[1] Georgia Inst Technol, Atlanta, GA 30332 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[3] Univ Calif Berkeley, Berkeley, CA USA

[4] Virginia Tech, Blacksburg, VA USA

来源：

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/CVPR.2017.121

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial contains 1 dialog (10 question-answer pairs) on similar to 140k images from the COCO dataset, with a total of similar to 1.4M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders (Late Fusion, Hierarchical Recurrent Encoder and Memory Network) and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify gap between machine and human performance on the Visual Dialog task via human studies. Our dataset, code, and trained models will be released publicly at visualdialog.org. Putting it all together, we demonstrate the first 'visual chatbot'!

引用

页码：1080 / 1089

页数：10

共 61 条

[21] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Goyal, Yash
Khot, Tejas
Summers-Stay, Douglas
Batra, Dhruv
Parikh, Devi
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6325 - 6334
[22] He K., 2016, PROC CVPR IEEE, P630, DOI [10.1007/978-3-319-46493-0_38, DOI 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]
[23] Hermann Karl Moritz, 2015, ADV NEUR IN, V28
[24] Segmentation from Natural Language Expressions
Hu, Ronghang
Rohrbach, Marcus
Darrell, Trevor
[J]. COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 : 108 - 124
[25] Revisiting Visual Question Answering Baselines
Jabri, Allan
Joulin, Armand
van der Maaten, Laurens
[J]. COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 : 727 - 739
[26] Smart Reply: Automated Response Suggestion for Email
Kannan, Anjuli
Kurach, Karol
Ravi, Sujith
Kaufmann, Tobias
Tomkins, Andrew
Miklos, Balint
Corrado, Greg
Lukacs, Laszlo
Ganea, Marina
Young, Peter
Ramavajjala, Vivek
[J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 955 - 964
[27] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
[28] What are you talking about? Text-to-Image Coreference
Kong, Chen
Lin, Dahua
Bansal, Mohit
Urtasun, Raquel
Fidler, Sanja
[J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 3558 - 3565
[29] Lemon O., 2006, EACL
[30] Li J., 2016, P 2016 C EMPIRICAL M

← 1 2 3 4 5 6 7 →