Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

被引:210
作者
Fan, Chenyou [1 ]
Zhang, Xiaofan [1 ]
Zhang, Shu [1 ]
Wang, Wensheng [1 ]
Zhang, Chi [1 ]
Huang, Heng [1 ,2 ]
机构
[1] JD COM, Beijing, Peoples R China
[2] JD Digits, Beijing, Peoples R China
来源
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年
关键词
D O I
10.1109/CVPR.2019.00210
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention. Our VideoQA model firstly generates the global context-aware visual and textual features respectively by interacting current inputs with memory contents. After that, it makes the attentional fusion of the multimodal visual and textual representations to infer the correct answer. Multiple cycles of reasoning can be made to iteratively refine attention weights of the multimodal data and improve the final representation of the QA pair. Experimental results demonstrate our approach achieves state-of-the-art performance on four VideoQA benchmark datasets.
引用
收藏
页码:1999 / 2007
页数:9
相关论文
共 32 条
[1]  
Agrawal Aishwarya., 2015, ICCV
[2]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[3]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[4]  
Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
[5]  
Chen David, 2011, ACL, P190
[6]   Motion-Appearance Co-Memory Networks for Video Question Answering [J].
Gao, Jiyang ;
Ge, Runzhou ;
Chen, Kan ;
Nevatia, Ram .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6576-6585
[7]  
Graves A., 2014, Generating sequences with recurrent neural networks
[8]   YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition [J].
Guadarrama, Sergio ;
Krishnamoorthy, Niveda ;
Malkarnenkar, Girish ;
Venugopalan, Subhashini ;
Mooney, Raymond ;
Darrell, Trevor ;
Saenko, Kate .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :2712-2719
[9]  
He K., 2016, IEEE C COMPUT VIS PA, DOI [10.1007/978-3-319-46493-0_38, DOI 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]
[10]   Attention-Based Multimodal Fusion for Video Description [J].
Hori, Chiori ;
Hori, Takaaki ;
Lee, Teng-Yok ;
Zhang, Ziming ;
Harsham, Bret ;
Hershey, John R. ;
Marks, Tim K. ;
Sumi, Kazuhiko .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4203-4212