Semantic Enhanced Video Captioning with Multi-feature Fusion

被引：3

作者：

Niu, Tian-Zi ^{[1
]}

Dong, Shan-Shan ^{[1
]}

Chen, Zhen-Duo ^{[1
]}

Luo, Xin ^{[1
]}

Guo, Shanqing ^{[2
]}

Huang, Zi ^{[3
]}

Xu, Xin-Shun ^{[1
]}

机构：

[1] Shandong Univ, Sch Software, Jinan 250101, Peoples R China

[2] Shandong Univ, Sch Cyber Sci & Technol, Qingdao 266237, Peoples R China

[3] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Australia

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2023年 / 19卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Video captioning; semantic encoder; discrete selection; multi-feature fusion; NETWORK;

D O I：

10.1145/3588572

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g., semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video-captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Second, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusionmodule uses a novel relation-aware attentionmechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-endmanner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.

引用

页数：21

共 50 条

[1] Multi-feature fusion refine network for video captioning
Wang, Guan-Hong
Du, Ji-Xiang
Zhang, Hong-Bo
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2022, 34 (03) : 483 - 497
[2] Video Captioning based on Multi-feature Fusion with Object
Zhou, Lijuan
Liu, Tao
Niu, Changyong
THIRTEENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2021), 2021, 11878
[3] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
Dong, Shanshan
Niu, Tianzi
Luo, Xin
Liu, Wu
Xu, Xinshun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[4] Early Fire Recognition Based on Multi-Feature Fusion of Video Smoke
Wang, Lin
Li, Aiguo
PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), 2017, : 5318 - 5323
[5] Multi-feature fusion for efficient inter prediction in versatile video coding
Wei, Xiaojie
Zeng, Hongji
Fang, Ying
Lin, Liqun
Chen, Weiling
Xu, Yiwen
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2024, 21 (06)
[6] Multi Semantic Feature Fusion Framework for Video Segmentation and Description
Liang, Rui
Zhu, Qingxin
PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MECHATRONICS ENGINEERING AND INFORMATION TECHNOLOGY (ICMEIT), 2016, 57 : 314 - 318
[7] Semantic Segmentation of Images Based on Multi-Feature Fusion and Convolutional Neural Networks
Wang, Zhenyu
Xiao, Juan
Zhang, Shuai
Qi, Baoqiang
JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (06)
[8] State Characterization of AC Contactor Based on Multi-feature Enhanced Fusion
Jiang X.
Cao Y.
Liu Y.
Liu S.
Gao S.
Zhou Z.
Gaodianya Jishu/High Voltage Engineering, 2024, 50 (01): : 282 - 291
[9] Smoke root detection from video sequences based on multi-feature fusion
Liming Lou
Feng Chen
Pengle Cheng
Ying Huang
Journal of Forestry Research, 2022, 33 : 1841 - 1856
[10] Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning
Gui, Yuling
Guo, Dan
Zhao, Ye
PROCEEDINGS OF THE 2ND WORKSHOP ON MULTIMEDIA FOR ACCESSIBLE HUMAN COMPUTER INTERFACES (MAHCI '19), 2019, : 25 - 32

← 1 2 3 4 5 →