Attention-based LSTM with Semantic Consistency for Videos Captioning

被引：45

作者：

Guo, Zhao ^{[1
]}

Gao, Lianli ^{[1
]}

Song, Jingkuan ^{[2
]}

Xu, Xing ^{[1
]}

Shao, Jie ^{[1
]}

Shen, Heng Tao ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China

[2] Columbia Univ, New York, NY USA

[3] Univ Queensland, Brisbane, Qld, Australia

来源：

MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE | 2016年

关键词：

Video Description; Attention Mechanism; Multimodal Embedding; LSTM; Semantic Consistence;

D O I：

10.1145/2964284.2967242

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, LSTM model is trained on video-sentence pairs to learn association of a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention which allows for salient features. Furthermore, most existing approaches model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multi-modal representations for generating sentences with rich semantic content. More specifically, we first propose an attention mechanism which uses the dynamic weighted sum of local 2D Convolutional Neural Network (CNN) and 3D CNN representations. Then, a LSTM decoder takes these visual features at time t and the word-embedding feature at time t-1 to generate important words. Finally, we uses multi-modal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate the superiority of our method than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

引用

页码：357 / 361

页数：5

共 27 条

[1]

[Anonymous], 2014, ARXIV14115654

[2]

[Anonymous], 2015, ARXIV150904942

[3]

[Anonymous], 2015, P 2015 C N AM CHAPT

[4]

[Anonymous], ARXIV150601144

[5]

[Anonymous], 2012, COMPUTER ENCE

[6]

[Anonymous], 2014, ARXIV

[7]

[Anonymous], 2012, CoRR

[8]

[Anonymous], 2015, ARXIV150203044

[9]

Banerjee S, 2005, P ACL WORKSH INTR EX, P6572, DOI DOI 10.3115/1626355.1626389

[10]

Chen D., 2011, P 49 ANN M ASS COMP, P190

← 1 2 3 →