BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

被引：5

作者：

Zhong, Maosheng ^{[1
]}

Zhang, Hao ^{[1
]}

Wang, Yong ^{[1
]}

Xiong, Hao ^{[1
]}

机构：

[1] Jiangxi Normal Univ, 99 Ziyang Ave, Nanchang, Jiangxi, Peoples R China

来源：

MACHINE VISION AND APPLICATIONS | 2022年 / 33卷 / 05期

关键词：

Video captioning; Bidirectional decoding; Transformer;

D O I：

10.1007/s00138-022-01329-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder-decoder models, particularly, the attention-based models (say Transformer). However, the existing transformer-based models may not fully exploit the semantic context, that is, only using the left-to-right style of context but ignoring the right-to-left counterpart. In this paper, we introduce a bidirectional (forward-backward) decoder to exploit both the left-to-right and right-to-left styles of context for the Transformer-based video captioning model. Thus, our model is called bidirectional Transformer (dubbed BiTransformer). Specifically, in the bridge of the encoder and forward decoder (aiming to capture the left-to-right context) used in the existing Transformer-based models, we plug in a backward decoder to capture the right-to-left context. Equipped with such bidirectional decoder, the semantic context of videos will be more fully exploited, resulting in better video captions. The effectiveness of our model is demonstrated over two benchmark datasets, i.e., MSVD and MSR-VTT,via comparing to the state-of-the-art methods. Particularly, in terms of the important evaluation metric CIDEr, the proposed model outperforms the state-of-the-art models with improvements of 1.2% in both datasets.

引用

页数：9

共 50 条

[41] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Ying Wang
Guoheng Huang
Lin Yuming
Haoliang Yuan
Chi-Man Pun
Wing-Kuen Ling
Lianglun Cheng
Applied Intelligence, 2022, 52 : 5241 - 5260
[42] Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning
Zhang, Wei
Wang, Bairui
Ma, Lin
Liu, Wei
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (12) : 3088 - 3101
[43] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
Li, Shun
Zhang, Ze-Fan
Ji, Yi
Li, Ying
Liu, Chun-Ping
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[44] Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
Dong, Chengbo
Chen, Xinru
Chen, Aozhu
Hu, Fan
Wang, Zihan
Li, Xirong
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4750 - 4754
[45] Video captioning based on dual learning via multiple reconstruction blocks
Putra, Bahy Helmi Hartoyo
Jeong, Cheol
IMAGE AND VISION COMPUTING, 2024, 148
[46] Relation-aware attention for video captioning via graph learning
Tu, Yunbin
Zhou, Chang
Guo, Junjun
Li, Huafeng
Gao, Shengxiang
Yu, Zhengtao
PATTERN RECOGNITION, 2023, 136
[47] Automatic Video Captioning via Multi-channel Sequential Encoding
Zhang, Chenyang
Tian, Yingli
COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 : 146 - 161
[48] Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning
Zhu, Fangyi
Hwang, Jenq-Neng
Ma, Zhanyu
Chen, Guang
Guo, Jun
IEEE ACCESS, 2020, 8 : 169146 - 169159
[49] BERTHA: Video Captioning Evaluation Via Transfer-Learned Human Assessment
Lebron, Luis
Graham, Yvette
McGuinness, Kevin
Kouramas, Konstantinos
O'Connor, Noel E.
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1566 - 1575
[50] Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling
Qi, Mengshi
Wang, Yunhong
Li, Annan
Luo, Jiebo
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (08) : 2617 - 2633

← 1 2 3 4 5 →