Video Captioning with Visual and Semantic Features

被引：5

作者：

Lee, Sujin ^{[1
]}

Kim, Incheol ^{[2
]}

机构：

[1] Kyonggi Univ, Dept Comp Sci, Grad Sch, Suwon, South Korea

[2] Kyonggi Univ, Dept Comp Sci, Suwon, South Korea

来源：

JOURNAL OF INFORMATION PROCESSING SYSTEMS | 2018年 / 14卷 / 06期

关键词：

Attention-Based Caption Generation; Deep Neural Networks; Semantic Feature; Video Captioning;

D O I：

10.3745/JIPS.02.0098

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video captioning refers to the process of extracting features from a video and generating video captions using the extracted features. This paper introduces a deep neural network model and its learning method for effective video captioning. In this study, visual features as well as semantic features, which effectively express the video, are also used. The visual features of the video are extracted using convolutional neural networks, such as C3D and ResNet, while the semantic features are extracted using a semantic feature extraction network proposed in this paper. Further, an attention-based caption generation network is proposed for effective generation of video captions using the extracted features. The performance and effectiveness of the proposed model is verified through various experiments using two large-scale video benchmarks such as the Microsoft Video Description (MSVD) and the Microsoft Research Video-To-Text (MSR-VTT).

引用

页码：1318 / 1330

页数：13

共 50 条

[31] Rich Visual and Language Representation with Complementary Semantics for Video Captioning
Tang, Pengjie
Wang, Hanli
Li, Qinyu
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
[32] Learning to enhance areal video captioning with visual question answering
Al Mehmadi, Shima M.
Bazi, Yakoub
Al Rahhal, Mohamad M.
Zuair, Mansour
INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (18) : 6395 - 6407
[33] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Wang, Ying
Huang, Guoheng
Lin Yuming
Yuan, Haoliang
Pun, Chi-Man
Ling, Wing-Kuen
Cheng, Lianglun
APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
[34] Memory-attended semantic context-aware network for video captioning
Chen, Shuqin
Zhong, Xian
Wu, Shifeng
Sun, Zhixin
Liu, Wenxuan
Jia, Xuemei
Xia, Hongxia
SOFT COMPUTING, 2021, 28 (Suppl 2) : 425 - 425
[35] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Ying Wang
Guoheng Huang
Lin Yuming
Haoliang Yuan
Chi-Man Pun
Wing-Kuen Ling
Lianglun Cheng
Applied Intelligence, 2022, 52 : 5241 - 5260
[36] Incorporating the Graph Representation of Video and Text into Video Captioning
Lu, Min
Li, Yuan
2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 396 - 401
[37] Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Liu, Fenglin
Wu, Xian
You, Chenyu
Ge, Shen
Zou, Yuexian
Sun, Xu
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 9255 - 9268
[38] Video Captioning Based on C3D and Visual Elements
Xiao H.
Shi J.
2018, South China University of Technology (46): : 88 - 95
[39] Multi-scale features with temporal information guidance for video captioning
Zhao, Hong
Chen, Zhiwen
Yang, Yi
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 137
[40] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
Oura, Soichiro
Matsukawa, Tetsu
Suzuki, Einoshin
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,

← 1 2 3 4 5 →