BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

被引:5
作者
Zhong, Maosheng [1 ]
Zhang, Hao [1 ]
Wang, Yong [1 ]
Xiong, Hao [1 ]
机构
[1] Jiangxi Normal Univ, 99 Ziyang Ave, Nanchang, Jiangxi, Peoples R China
关键词
Video captioning; Bidirectional decoding; Transformer;
D O I
10.1007/s00138-022-01329-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder-decoder models, particularly, the attention-based models (say Transformer). However, the existing transformer-based models may not fully exploit the semantic context, that is, only using the left-to-right style of context but ignoring the right-to-left counterpart. In this paper, we introduce a bidirectional (forward-backward) decoder to exploit both the left-to-right and right-to-left styles of context for the Transformer-based video captioning model. Thus, our model is called bidirectional Transformer (dubbed BiTransformer). Specifically, in the bridge of the encoder and forward decoder (aiming to capture the left-to-right context) used in the existing Transformer-based models, we plug in a backward decoder to capture the right-to-left context. Equipped with such bidirectional decoder, the semantic context of videos will be more fully exploited, resulting in better video captions. The effectiveness of our model is demonstrated over two benchmark datasets, i.e., MSVD and MSR-VTT,via comparing to the state-of-the-art methods. Particularly, in terms of the important evaluation metric CIDEr, the proposed model outperforms the state-of-the-art models with improvements of 1.2% in both datasets.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network
    Liu, Chunsheng
    Zhang, Xiao
    Chang, Faliang
    Li, Shuang
    Hao, Penghui
    Lu, Yansha
    Wang, Yinhai
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (05) : 3615 - 3627
  • [32] Multimodal Context Fusion Based Dense Video Captioning Algorithm
    Li, Meiqi
    Zhou, Ziwei
    ENGINEERING LETTERS, 2025, 33 (04) : 1061 - 1072
  • [33] Multi-level video captioning method based on semantic space
    Yao, Xiao
    Zeng, Yuanlin
    Gu, Min
    Yuan, Ruxi
    Li, Jie
    Ge, Junyi
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72113 - 72130
  • [34] Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
    Shi, Botian
    Ji, Lei
    Niu, Zhendong
    Duan, Nan
    Zhou, Ming
    Chen, Xilin
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4337 - 4345
  • [35] Global-Local Combined Semantic Generation Network for Video Captioning
    Mao L.
    Gao H.
    Yang D.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (09): : 1374 - 1382
  • [36] Bidirectional Maximum Entropy Training With Word Co-Occurrence for Video Captioning
    Liu, Sheng
    Li, Annan
    Wang, Jiahao
    Wang, Yunhong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 4494 - 4507
  • [37] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
    Wu, Shuai
    Gao, Yubing
    Yang, Weidong
    Li, Hongkai
    Zhu, Guangyu
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 22 : 4682 - 4690
  • [38] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
  • [39] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
  • [40] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)