BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

被引：5

作者：

Zhong, Maosheng ^{[1
]}

Zhang, Hao ^{[1
]}

Wang, Yong ^{[1
]}

Xiong, Hao ^{[1
]}

机构：

[1] Jiangxi Normal Univ, 99 Ziyang Ave, Nanchang, Jiangxi, Peoples R China

来源：

MACHINE VISION AND APPLICATIONS | 2022年 / 33卷 / 05期

关键词：

Video captioning; Bidirectional decoding; Transformer;

D O I：

10.1007/s00138-022-01329-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder-decoder models, particularly, the attention-based models (say Transformer). However, the existing transformer-based models may not fully exploit the semantic context, that is, only using the left-to-right style of context but ignoring the right-to-left counterpart. In this paper, we introduce a bidirectional (forward-backward) decoder to exploit both the left-to-right and right-to-left styles of context for the Transformer-based video captioning model. Thus, our model is called bidirectional Transformer (dubbed BiTransformer). Specifically, in the bridge of the encoder and forward decoder (aiming to capture the left-to-right context) used in the existing Transformer-based models, we plug in a backward decoder to capture the right-to-left context. Equipped with such bidirectional decoder, the semantic context of videos will be more fully exploited, resulting in better video captions. The effectiveness of our model is demonstrated over two benchmark datasets, i.e., MSVD and MSR-VTT,via comparing to the state-of-the-art methods. Particularly, in terms of the important evaluation metric CIDEr, the proposed model outperforms the state-of-the-art models with improvements of 1.2% in both datasets.

引用

页数：9

共 50 条

[21] Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
Chen, Jingwen
Pan, Yingwei
Li, Yehao
Yao, Ting
Chao, Hongyang
Mei, Tao
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)
[22] Empirical autopsy of deep video captioning encoder-decoder architecture
Aafaq, Nayyer
Akhtar, Naveed
Liu, Wei
Mian, Ajmal
ARRAY, 2021, 9
[23] Context Gating with Short Temporal Information for Video Captioning
Xu, Jinlei
Xu, Ting
Tian, Xin
Liu, Chunping
Ji, Yi
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[24] Video Captioning Based on Channel Soft Attention and Semantic Reconstructor
Lei, Zhou
Huang, Yiyong
FUTURE INTERNET, 2021, 13 (02) : 1 - 18
[25] Video Captioning With Attention-Based LSTM and Semantic Consistency
Gao, Lianli
Guo, Zhao
Zhang, Hanwang
Xu, Xing
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2045 - 2055
[26] A Video Captioning Method by Semantic Topic-Guided Generation
Ye, Ou
Wei, Xinli
Yu, Zhenhua
Fu, Yan
Yang, Ying
CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (01): : 1071 - 1093
[27] Video captioning algorithm based on mixed training and semantic association
Chen, Shuqin
Zhong, Xian
Huang, Wenxin
Lu, Yansheng
Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51 (11): : 67 - 74
[28] Fused GRU with semantic-temporal attention for video captioning
Gao, Lianli
Wang, Xuanhan
Song, Jingkuan
Liu, Yang
NEUROCOMPUTING, 2020, 395 : 222 - 228
[29] Semantic Enhanced Video Captioning with Multi-feature Fusion
Niu, Tian-Zi
Dong, Shan-Shan
Chen, Zhen-Duo
Luo, Xin
Guo, Shanqing
Huang, Zi
Xu, Xin-Shun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
[30] Encoder-Decoder Model for Automatic Video Captioning Using Yolo Algorithm
Alkalouti, Hanan Nasser
Al Masre, Mayada Ahmed
2021 IEEE INTERNATIONAL IOT, ELECTRONICS AND MECHATRONICS CONFERENCE (IEMTRONICS), 2021, : 718 - 721

← 1 2 3 4 5 →