BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

被引：5

作者：

Zhong, Maosheng ^{[1
]}

Zhang, Hao ^{[1
]}

Wang, Yong ^{[1
]}

Xiong, Hao ^{[1
]}

机构：

[1] Jiangxi Normal Univ, 99 Ziyang Ave, Nanchang, Jiangxi, Peoples R China

来源：

MACHINE VISION AND APPLICATIONS | 2022年 / 33卷 / 05期

关键词：

Video captioning; Bidirectional decoding; Transformer;

D O I：

10.1007/s00138-022-01329-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder-decoder models, particularly, the attention-based models (say Transformer). However, the existing transformer-based models may not fully exploit the semantic context, that is, only using the left-to-right style of context but ignoring the right-to-left counterpart. In this paper, we introduce a bidirectional (forward-backward) decoder to exploit both the left-to-right and right-to-left styles of context for the Transformer-based video captioning model. Thus, our model is called bidirectional Transformer (dubbed BiTransformer). Specifically, in the bridge of the encoder and forward decoder (aiming to capture the left-to-right context) used in the existing Transformer-based models, we plug in a backward decoder to capture the right-to-left context. Equipped with such bidirectional decoder, the semantic context of videos will be more fully exploited, resulting in better video captions. The effectiveness of our model is demonstrated over two benchmark datasets, i.e., MSVD and MSR-VTT,via comparing to the state-of-the-art methods. Particularly, in terms of the important evaluation metric CIDEr, the proposed model outperforms the state-of-the-art models with improvements of 1.2% in both datasets.

引用

页数：9

共 50 条

[1] BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
Maosheng Zhong
Hao Zhang
Yong Wang
Hao Xiong
Machine Vision and Applications, 2022, 33
[2] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
Gui, Yuling
Guo, Dan
Zhao, Ye
PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
[3] Video Captioning with Semantic Guiding
Yuan, Jin
Tian, Chunna
Zhang, Xiangnan
Ding, Yuxuan
Wei, Wei
2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
[4] Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
Sun, Zhixin
Zhong, Xian
Chen, Shuqin
Liu, Wenxuan
Feng, Duxiu
Li, Lin
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 677 - 689
[5] Memory-attended semantic context-aware network for video captioning
Chen, Shuqin
Zhong, Xian
Wu, Shifeng
Sun, Zhixin
Liu, Wenxuan
Jia, Xuemei
Xia, Hongxia
SOFT COMPUTING, 2021, 28 (Suppl 2) : 425 - 425
[6] Video Captioning with Visual and Semantic Features
Lee, Sujin
Kim, Incheol
JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (06): : 1318 - 1330
[7] Bidirectional transformer with knowledge graph for video captioning
Zhong, Maosheng
Chen, Youde
Zhang, Hao
Xiong, Hao
Wang, Zhixiang
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (20) : 58309 - 58328
[8] Discriminative Latent Semantic Graph for Video Captioning
Bai, Yang
Wang, Junyan
Long, Yang
Hu, Bingzhang
Song, Yang
Pagnucco, Maurice
Guan, Yu
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3556 - 3564
[9] Chained semantic generation network for video captioning
Mao L.
Gao H.
Yang D.
Zhang R.
Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2022, 30 (24): : 3198 - 3209
[10] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305

← 1 2 3 4 5 →