BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

被引:5
作者
Zhong, Maosheng [1 ]
Zhang, Hao [1 ]
Wang, Yong [1 ]
Xiong, Hao [1 ]
机构
[1] Jiangxi Normal Univ, 99 Ziyang Ave, Nanchang, Jiangxi, Peoples R China
关键词
Video captioning; Bidirectional decoding; Transformer;
D O I
10.1007/s00138-022-01329-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder-decoder models, particularly, the attention-based models (say Transformer). However, the existing transformer-based models may not fully exploit the semantic context, that is, only using the left-to-right style of context but ignoring the right-to-left counterpart. In this paper, we introduce a bidirectional (forward-backward) decoder to exploit both the left-to-right and right-to-left styles of context for the Transformer-based video captioning model. Thus, our model is called bidirectional Transformer (dubbed BiTransformer). Specifically, in the bridge of the encoder and forward decoder (aiming to capture the left-to-right context) used in the existing Transformer-based models, we plug in a backward decoder to capture the right-to-left context. Equipped with such bidirectional decoder, the semantic context of videos will be more fully exploited, resulting in better video captions. The effectiveness of our model is demonstrated over two benchmark datasets, i.e., MSVD and MSR-VTT,via comparing to the state-of-the-art methods. Particularly, in terms of the important evaluation metric CIDEr, the proposed model outperforms the state-of-the-art models with improvements of 1.2% in both datasets.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Ying Wang
    Guoheng Huang
    Lin Yuming
    Haoliang Yuan
    Chi-Man Pun
    Wing-Kuen Ling
    Lianglun Cheng
    Applied Intelligence, 2022, 52 : 5241 - 5260
  • [42] Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning
    Zhang, Wei
    Wang, Bairui
    Ma, Lin
    Liu, Wei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (12) : 3088 - 3101
  • [43] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [44] Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
    Dong, Chengbo
    Chen, Xinru
    Chen, Aozhu
    Hu, Fan
    Wang, Zihan
    Li, Xirong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4750 - 4754
  • [45] Video captioning based on dual learning via multiple reconstruction blocks
    Putra, Bahy Helmi Hartoyo
    Jeong, Cheol
    IMAGE AND VISION COMPUTING, 2024, 148
  • [46] Relation-aware attention for video captioning via graph learning
    Tu, Yunbin
    Zhou, Chang
    Guo, Junjun
    Li, Huafeng
    Gao, Shengxiang
    Yu, Zhengtao
    PATTERN RECOGNITION, 2023, 136
  • [47] Automatic Video Captioning via Multi-channel Sequential Encoding
    Zhang, Chenyang
    Tian, Yingli
    COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 : 146 - 161
  • [48] Understanding Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning
    Zhu, Fangyi
    Hwang, Jenq-Neng
    Ma, Zhanyu
    Chen, Guang
    Guo, Jun
    IEEE ACCESS, 2020, 8 : 169146 - 169159
  • [49] BERTHA: Video Captioning Evaluation Via Transfer-Learned Human Assessment
    Lebron, Luis
    Graham, Yvette
    McGuinness, Kevin
    Kouramas, Konstantinos
    O'Connor, Noel E.
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1566 - 1575
  • [50] Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling
    Qi, Mengshi
    Wang, Yunhong
    Li, Annan
    Luo, Jiebo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (08) : 2617 - 2633