Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

被引:12
作者
Garcia, Noa [1 ]
Nakashima, Yuta [1 ]
机构
[1] Osaka Univ, Suita, Osaka, Japan
来源
COMPUTER VISION - ECCV 2020, PT XVIII | 2020年 / 12363卷
关键词
Video question answering; Video description; Knowledge bases;
D O I
10.1007/978-3-030-58523-5_34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.
引用
收藏
页码:581 / 598
页数:18
相关论文
共 58 条
[1]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[2]   DBpedia: A nucleus for a web of open data [J].
Auer, Soeren ;
Bizer, Christian ;
Kobilarov, Georgi ;
Lehmann, Jens ;
Cyganiak, Richard ;
Ives, Zachary .
SEMANTIC WEB, PROCEEDINGS, 2007, 4825 :722-+
[3]   Less Is More: Picking Informative Frames for Video Captioning [J].
Chen, Yangyu ;
Wang, Shuhui ;
Zhang, Weigang ;
Huang, Qingming .
COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :367-384
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]  
Garcia N., 2018, BMVC
[6]  
Garcia N, 2020, AAAI CONF ARTIF INTE, V34, P10826
[7]   Unpaired Image Captioning via Scene Graph Alignments [J].
Gu, Jiuxiang ;
Joty, Shafiq ;
Cai, Jianfei ;
Zhao, Handong ;
Yang, Xu ;
Wang, Gang .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :10322-10331
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]  
Hewlett D., 2017, Proceedings of the EMNLP, P2011, DOI [10.18653/v1/d17-1214, DOI 10.18653/V1]
[10]  
Hu MH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2285