Adversarial Reinforcement Learning With Object-Scene Relational Graph for Video Captioning

被引:13
作者
Hua, Xia [1 ]
Wang, Xinqing [1 ]
Rui, Ting [1 ]
Shao, Faming [1 ]
Wang, Dong [1 ]
机构
[1] Army Engn Univ PLA, Coll Field Engn, Nanjing 210007, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Feature extraction; Visualization; Reinforcement learning; Convolution; Training; Trajectory; Adversarial reinforcement learning; graph neural network; scene relational graph; video captioning; video understanding;
D O I
10.1109/TIP.2022.3148868
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing video captioning methods usually ignore the important fine-grained semantic attributes, the video diversity, as well as the association and motion state between objects within and between frames. Thus, they cannot adapt to small sample data sets. To solve the above problems, this paper proposes a novel video captioning model and an adversarial reinforcement learning strategy. Firstly, an object-scene relational graph model is designed based on the object detector and scene segmenter to express the association features. The graph is encoded by the graph neural network to enrich the expression of visual features. Meanwhile, a trajectory-based feature representation model is designed to replace the previous data-driven method to extract motion and attribute information, so as to analyze the object motion in the time domain and establish the connection between the visual content and language under small data sets. Finally, an adversarial reinforcement learning strategy and a multi- branch discriminator are designed to learn the relationship between the visual content and corresponding words so that rich language knowledge is integrated into the model. Experimental results on three standard datasets and one small sample dataset indicate that our proposed method achieves state-of-the-art performance. Also, the ablation experiments and visualization results verify the effectiveness of proposed each strategy.
引用
收藏
页码:2004 / 2016
页数:13
相关论文
共 45 条
  • [1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
    Aafaq, Nayyer
    Akhtar, Naveed
    Liu, Wei
    Gilani, Syed Zulqarnain
    Mian, Ajmal
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
  • [2] [Anonymous], 2014, ARXIV14124729
  • [3] Exposing Computer Generated Images by Eye's Region Classification via Transfer Learning of VGG19 CNN
    Carvalho, Tiago
    de Rezende, Edmar R. S.
    Alves, Matheus T. P.
    Balieiro, Fernanda K. C.
    Sovat, Ricardo B.
    [J]. 2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 866 - 870
  • [4] Chen D., 2011, P 49 ANN M ASS COMP, P190
  • [5] Chen K., ARXIV190607155, V2019
  • [6] Chen SX, 2019, AAAI CONF ARTIF INTE, P8191
  • [7] Chen Y., 2020, THESIS U ELECT SCI T, DOI [10.27005/d.cnki.gdzku.2020.002623, DOI 10.27005/D.CNKI.GDZKU.2020.002623]
  • [8] Denkowski M., 2014, Proceedings of the ninth workshop on statistical machine translation, P376, DOI DOI 10.3115/V1/W14-3348
  • [9] Fused GRU with semantic-temporal attention for video captioning
    Gao, Lianli
    Wang, Xuanhan
    Song, Jingkuan
    Liu, Yang
    [J]. NEUROCOMPUTING, 2020, 395 : 222 - 228
  • [10] Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning
    Hou, Jingyi
    Wu, Xinxiao
    Zhao, Wentian
    Luo, Jiebo
    Jia, Yunde
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8917 - 8926