Motion Guided Region Message Passing for Video Captioning

被引:51
作者
Chen, Shaoxiang [1 ]
Jiang, Yu-Gang [1 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai Collaborat Innovat Ctr Intelligent Visua, Shanghai, Peoples R China
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV48922.2021.00157
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is an important vision task and has been intensively studied in the computer vision community. Existing methods that utilize the fine-grained spatial information have achieved significant improvements, however, they either rely on costly external object detectors or do not sufficiently model the spatial/temporal relations. In this paper, we aim at designing a spatial information extraction and aggregation method for video captioning without the need of external object detectors. For this purpose, we propose a Recurrent Region Attention module to better extract diverse spatial features, and by employing Motion-Guided Cross-frame Message Passing, our model is aware of the temporal structure and able to establish high-order relations among the diverse regions across frames. They jointly encourage information communication and produce compact and powerful video representations. Furthermore, an Adjusted Temporal Graph Decoder is proposed to flexibly update video features and model high-order temporal relations during decoding. Experimental results on three benchmark datasets: MSVD, MSR-VTT, and VATEX demonstrate that our proposed method can outperform state-of-the-art methods.
引用
收藏
页码:1523 / 1532
页数:10
相关论文
共 53 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.415
[3]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.01040
[4]  
[Anonymous], 2019, AAAI
[5]  
[Anonymous], 2017, ICML
[6]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.01277
[7]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00854
[8]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[9]  
Chen David L., 2011, P 49 ANN M ASS COMP
[10]  
Chen JW, 2019, AAAI CONF ARTIF INTE, P8167