Video Captioning with Visual and Semantic Features

被引：5

作者：

Lee, Sujin ^{[1
]}

Kim, Incheol ^{[2
]}

机构：

[1] Kyonggi Univ, Dept Comp Sci, Grad Sch, Suwon, South Korea

[2] Kyonggi Univ, Dept Comp Sci, Suwon, South Korea

来源：

JOURNAL OF INFORMATION PROCESSING SYSTEMS | 2018年 / 14卷 / 06期

关键词：

Attention-Based Caption Generation; Deep Neural Networks; Semantic Feature; Video Captioning;

D O I：

10.3745/JIPS.02.0098

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video captioning refers to the process of extracting features from a video and generating video captions using the extracted features. This paper introduces a deep neural network model and its learning method for effective video captioning. In this study, visual features as well as semantic features, which effectively express the video, are also used. The visual features of the video are extracted using convolutional neural networks, such as C3D and ResNet, while the semantic features are extracted using a semantic feature extraction network proposed in this paper. Further, an attention-based caption generation network is proposed for effective generation of video captions using the extracted features. The performance and effectiveness of the proposed model is verified through various experiments using two large-scale video benchmarks such as the Microsoft Video Description (MSVD) and the Microsoft Research Video-To-Text (MSR-VTT).

引用

页码：1318 / 1330

页数：13

共 50 条

[21] Fused GRU with semantic-temporal attention for video captioning
Gao, Lianli
Wang, Xuanhan
Song, Jingkuan
Liu, Yang
NEUROCOMPUTING, 2020, 395 : 222 - 228
[22] Visual Relation-Aware Unsupervised Video Captioning
Ji, Puzhao
Cao, Meng
Zou, Yuexian
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
[23] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
Gui, Yuling
Guo, Dan
Zhao, Ye
PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32
[24] BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
Maosheng Zhong
Hao Zhang
Yong Wang
Hao Xiong
Machine Vision and Applications, 2022, 33
[25] Center-enhanced video captioning model with multimodal semantic alignment
Zhang, Benhui
Gao, Junyu
Yuan, Yuan
NEURAL NETWORKS, 2024, 180
[26] BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
Zhong, Maosheng
Zhang, Hao
Wang, Yong
Xiong, Hao
MACHINE VISION AND APPLICATIONS, 2022, 33 (05)
[27] Multi-level video captioning method based on semantic space
Yao, Xiao
Zeng, Yuanlin
Gu, Min
Yuan, Ruxi
Li, Jie
Ge, Junyi
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72113 - 72130
[28] Global-Local Combined Semantic Generation Network for Video Captioning
Mao L.
Gao H.
Yang D.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (09): : 1374 - 1382
[29] Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
Shi, Botian
Ji, Lei
Niu, Zhendong
Duan, Nan
Zhou, Ming
Chen, Xilin
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4337 - 4345
[30] Visual Commonsense-Aware Representation Network for Video Captioning
Zeng, Pengpeng
Zhang, Haonan
Gao, Lianli
Li, Xiangpeng
Qian, Jin
Shen, Heng Tao
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (01) : 1092 - 1103

← 1 2 3 4 5 →