Spatio-Temporal Ranked-Attention Networks for Video Captioning

被引：0

作者：

Cherian, Anoop ^{[1
]}

Wang, Jue ^{[2
]}

Hori, Chiori ^{[1
]}

Marks, Tim K. ^{[1
]}

机构：

[1] Mitsubishi Elect Res Labs, Cambridge, MA 02139 USA

[2] Australian Natl Univ, Canberra, ACT, Australia

来源：

2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2020年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatiotemporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame. We propose a novel LSTM-based temporal ranking function, which we call ranked attention, for the ST model to capture action dynamics. Our entire framework is trained end-to-end. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.

引用

页码：1606 / 1615

页数：10

共 72 条

[1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Aafaq, Nayyer
Akhtar, Naveed
Liu, Wei
Gilani, Syed Zulqarnain
Mian, Ajmal
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
[2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[3] [Anonymous], 2015, arXiv:1504.00325
[4] [Anonymous], ICPR
[5] [Anonymous], ICIP
[6] [Anonymous], ICCV
[7] [Anonymous], 2017, IJCAI
[8] Aytar Y, 2016, ADV NEUR IN, V29
[9] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[10] Bengio S.., 2015, Advances in Neural Information Processing Systems

← 1 2 3 4 5 6 7 8 →