Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning

被引：2

作者：

Sun, Zhixin ^{[1
]}

Zhong, Xian ^{[1
,2
]}

Chen, Shuqin ^{[1
]}

Liu, Wenxuan ^{[1
]}

Feng, Duxiu ^{[3
]}

Li, Lin ^{[1
,2
]}

机构：

[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Peoples R China

[2] Wuhan Univ Technol, Hubei Key Lab Transportat Internet Things, Wuhan 430070, Peoples R China

[3] ZhongQianLiYuan Engn Consulting Co Ltd, Wuhan 430071, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V | 2021年 / 12895卷

关键词：

Video captioning; Visual semantic feature; Linguistic semantic feature; Semantic loss; Attention mechanism;

D O I：

10.1007/978-3-030-86383-8_54

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It has received increasing attention to exploiting temporal visual features and corresponding descriptions in video captioning. Most of the existing models generate the captioning words merely depend on video temporal structure, ignoring fine-grained complete scene information. And the traditional long-short term memory (LSTM) in recent models is used as decoder to generate sentences, the last generated states in previous hidden ones are used to do work directly to predict the current word. This may lead to the predicted word highly related to the last generated hidden state other than the overall context information. To model the temporal aspects of activities typically shown in the video and better capture long-range context information, we propose a novel video captioning framework via context-guided semantic features model (CSF). Specifically, to maximum information flow, several previous and future information are aggregated to guide the current token by the semantic loss in the encoding and decoding phase respectively. The visual and linguistic information are corrected by fusing the surrounding information. Extensive experiments conducted on MSVD and MSR-VTT video captioning datasets demonstrate the effectiveness of our method compared with state-of-the-art approaches.

引用

页码：677 / 689

页数：13

共 23 条

[1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning [J].

Aafaq, Nayyer ;

Akhtar, Naveed ;

Liu, Wei ;

Gilani, Syed Zulqarnain ;

Mian, Ajmal .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12479-12488

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4]

Chen J, 2020, P ACM INT C MULT MM, P4605

[5]

Chen JW, 2019, AAAI CONF ARTIF INTE, P8167

[6]

Chen SX, 2019, AAAI CONF ARTIF INTE, P8191

[7] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM [J].

Chen, Shuqin ;

Zhong, Xian ;

Li, Lin ;

Liu, Wenxuan ;

Gu, Cheng ;

Zhong, Luo .

NEURAL PROCESSING LETTERS, 2020, 52 (03) :2353-2369

[8] A Novel Image Captioning Method Based on Generative Adversarial Networks [J].

Fan, Yang ;

Xu, Jungang ;

Sun, Yingfei ;

Wang, Yiyu .

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: TEXT AND TIME SERIES, PT IV, 2019, 11730 :281-292

[9] Hierarchical LSTMs with Adaptive Attention for Visual Captioning [J].

Gao, Lianli ;

Li, Xiangpeng ;

Song, Jingkuan ;

Shen, Heng Tao .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (05) :1112-1131

[10]

Hou JY, 2020, AAAI CONF ARTIF INTE, V34, P10973

← 1 2 3 →