Adversarial Reinforcement Learning With Object-Scene Relational Graph for Video Captioning

被引：13

作者：

Hua, Xia ^{[1
]}

Wang, Xinqing ^{[1
]}

Rui, Ting ^{[1
]}

Shao, Faming ^{[1
]}

Wang, Dong ^{[1
]}

机构：

[1] Army Engn Univ PLA, Coll Field Engn, Nanjing 210007, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Semantics; Feature extraction; Visualization; Reinforcement learning; Convolution; Training; Trajectory; Adversarial reinforcement learning; graph neural network; scene relational graph; video captioning; video understanding;

D O I：

10.1109/TIP.2022.3148868

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing video captioning methods usually ignore the important fine-grained semantic attributes, the video diversity, as well as the association and motion state between objects within and between frames. Thus, they cannot adapt to small sample data sets. To solve the above problems, this paper proposes a novel video captioning model and an adversarial reinforcement learning strategy. Firstly, an object-scene relational graph model is designed based on the object detector and scene segmenter to express the association features. The graph is encoded by the graph neural network to enrich the expression of visual features. Meanwhile, a trajectory-based feature representation model is designed to replace the previous data-driven method to extract motion and attribute information, so as to analyze the object motion in the time domain and establish the connection between the visual content and language under small data sets. Finally, an adversarial reinforcement learning strategy and a multi- branch discriminator are designed to learn the relationship between the visual content and corresponding words so that rich language knowledge is integrated into the model. Experimental results on three standard datasets and one small sample dataset indicate that our proposed method achieves state-of-the-art performance. Also, the ablation experiments and visualization results verify the effectiveness of proposed each strategy.

引用

页码：2004 / 2016

页数：13

共 45 条

[1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Aafaq, Nayyer
Akhtar, Naveed
Liu, Wei
Gilani, Syed Zulqarnain
Mian, Ajmal
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
[2] [Anonymous], 2014, ARXIV14124729
[3] Exposing Computer Generated Images by Eye's Region Classification via Transfer Learning of VGG19 CNN
Carvalho, Tiago
de Rezende, Edmar R. S.
Alves, Matheus T. P.
Balieiro, Fernanda K. C.
Sovat, Ricardo B.
[J]. 2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 866 - 870
[4] Chen D., 2011, P 49 ANN M ASS COMP, P190
[5] Chen K., ARXIV190607155, V2019
[6] Chen SX, 2019, AAAI CONF ARTIF INTE, P8191
[7] Chen Y., 2020, THESIS U ELECT SCI T, DOI [10.27005/d.cnki.gdzku.2020.002623, DOI 10.27005/D.CNKI.GDZKU.2020.002623]
[8] Denkowski M., 2014, Proceedings of the ninth workshop on statistical machine translation, P376, DOI DOI 10.3115/V1/W14-3348
[9] Fused GRU with semantic-temporal attention for video captioning
Gao, Lianli
Wang, Xuanhan
Song, Jingkuan
Liu, Yang
[J]. NEUROCOMPUTING, 2020, 395 : 222 - 228
[10] Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning
Hou, Jingyi
Wu, Xinxiao
Zhao, Wentian
Luo, Jiebo
Jia, Yunde
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8917 - 8926

← 1 2 3 4 5 →