Global-Local Combined Semantic Generation Network for Video Captioning

被引：0

作者：

Mao L. ^{[1
]}

Gao H. ^{[1
]}

Yang D. ^{[1
]}

机构：

[1] College of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian

来源：

Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics | 2023年 / 35卷 / 09期

关键词：

multi-layer perceptron; residual structure; semantic features; video captioning; visual features;

D O I：

10.3724/SP.J.1089.2023.19619

中图分类号：

学科分类号：

摘要：

Aiming at the problem that the semantic features in video captioning cannot take into account the global general information and local detail information, which affects the video captioning effect, a global-local combined semantic generation network (GLS-Net) in video captioning is proposed. Firstly, based on the complementarity of global and local information, the global and local semantic extraction units are designed, and the two units innovatively adopt a residual multi-layer perceptron (r-MLP) structure to enhance the feature processing effect. Secondly, the algorithm combines general global semantics and detailed local semantics to strengthen the expression ability of semantic features. Finally, the features obtained are used as video content coding to improve the video captioning performance. On MSR-VTT and MSVD datasets, simulations are carried out based on semantics-assisted video captioning (SAVC) network. Experimental results show that GLS-Net is superior to existing similar algorithms. Compared with SAVC network, the accuracy is increased by 6.2% on average. © 2023 Institute of Computing Technology. All rights reserved.

引用

页码：1374 / 1382

页数：8

共 25 条

[1] Chen H R, Lin K, Maye A, Et al., A semantics-assisted video captioning model trained with scheduled sampling, Frontiers in Robotics and AI, 7, (2020)
[2] Tu Y B, Zhang X S, Liu B T, Et al., Video description with spatial-temporal attention, Proceedings of the 25th ACM International Conference on Multimedia, pp. 1014-1022, (2017)
[3] Zheng Q, Wang C Y, Tao D C., Syntax-aware action targeting for video captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13093-13102, (2020)
[4] Zhang J C, Peng Y X., Video captioning with object-aware spatio-temporal correlation and aggregation, IEEE Transactions on Image Processing, 29, pp. 6209-6222, (2020)
[5] Li Yao, Torabi A, Cho K, Et al., Describing videos by exploiting temporal structure, Proceedings of the IEEE International Conference on Computer Vision, pp. 4507-4515, (2015)
[6] Zolfaghari M, Singh K, Brox T., ECO: efficient convolutional network for online video understanding, Proceedings of the European Conference on Computer Vision, pp. 713-730, (2018)
[7] Gan Z, Gan C, He X D, Et al., Semantic compositional networks for visual captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1141-1150, (2017)
[8] Xu G H, Niu S C, Tan M K, Et al., Towards accurate text-based image captioning with content diversity exploration, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12632-12641, (2021)
[9] Jiang W H, Zhu M W, Fang Y M, Et al., Visual cluster grounding for image captioning, IEEE Transactions on Image Processing, 31, pp. 3920-3934, (2022)
[10] Venugopalan S, Rohrbach M, Donahue J, Et al., Sequence to sequence-video to text, Proceedings of the IEEE International Conference on Computer Vision, pp. 4534-4542, (2015)

← 1 2 3 →