Motion-Appearance Co-Memory Networks for Video Question Answering

被引:186
作者
Gao, Jiyang [1 ]
Ge, Runzhou [1 ]
Chen, Kan [1 ]
Nevatia, Ram [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90089 USA
来源
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年
关键词
D O I
10.1109/CVPR.2018.00688
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usually correlated with each other and able to provide useful attention cues to the other; (3) different questions require different number of frames to infer the answer. Based on these observations, we propose a motion-appearance co-memory network for video QA. Our networks are built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA. Specifically, there are three salient aspects: (1) a co-memory attention mechanism that utilizes cues from both motion and appearance to generate attention; (2) a temporal conv-deconv network to generate multi-level contextual facts; (3) a dynamic fact ensemble method to construct temporal representation dynamically for different questions. We evaluate our method on TGIF-QA dataset, and the results outperform state-of-the-art significantly on all four tasks of TGIF-QA.
引用
收藏
页码:6576 / 6585
页数:10
相关论文
共 40 条
[1]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[2]  
[Anonymous], 2016, IEEE C COMP VIS PATT
[3]  
[Anonymous], ICCV
[4]  
[Anonymous], 2017, ICCV
[5]  
[Anonymous], 2017, IJCAI
[6]  
[Anonymous], 2017, ICCV
[7]  
[Anonymous], 2017, ICCV
[8]  
[Anonymous], 2017, ICCV
[9]  
[Anonymous], 2016, P INT C MACHINE LEAR
[10]  
[Anonymous], 2016, P IEEE C COMP VIS PA