Semantic Enhanced Video Captioning with Multi-feature Fusion

被引:3
|
作者
Niu, Tian-Zi [1 ]
Dong, Shan-Shan [1 ]
Chen, Zhen-Duo [1 ]
Luo, Xin [1 ]
Guo, Shanqing [2 ]
Huang, Zi [3 ]
Xu, Xin-Shun [1 ]
机构
[1] Shandong Univ, Sch Software, Jinan 250101, Peoples R China
[2] Shandong Univ, Sch Cyber Sci & Technol, Qingdao 266237, Peoples R China
[3] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Australia
基金
中国国家自然科学基金;
关键词
Video captioning; semantic encoder; discrete selection; multi-feature fusion; NETWORK;
D O I
10.1145/3588572
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g., semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video-captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Second, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusionmodule uses a novel relation-aware attentionmechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-endmanner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] High Speed Front-Vehicle Detection Based on Video Multi-feature Fusion
    Xiong, Liliang
    Yue, Wenjing
    Xu, Qiushi
    Zhu, Zhengtian
    Chen, Zhi
    PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, : 348 - 351
  • [22] Background Modeling Algorithm for Multi-feature Fusion
    Guo, Zhicheng
    Dang, Jianwu
    Wang, Yangping
    Jin, Jing
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1117 - 1121
  • [23] Knowledge tracing based on multi-feature fusion
    Yongkang Xiao
    Rong Xiao
    Ning Huang
    Yixin Hu
    Huan Li
    Bo Sun
    Neural Computing and Applications, 2023, 35 : 1819 - 1833
  • [24] Knowledge tracing based on multi-feature fusion
    Xiao, Yongkang
    Xiao, Rong
    Huang, Ning
    Hu, Yixin
    Li, Huan
    Sun, Bo
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (02): : 1819 - 1833
  • [25] Multi-feature fusion target tracking algorithm
    Liang Hui-hui
    He Qiu-sheng
    Jia Wei-zhen
    Zhang Wei-feng
    CHINESE JOURNAL OF LIQUID CRYSTALS AND DISPLAYS, 2020, 35 (06) : 583 - 594
  • [26] MFNet: Multi-Feature Fusion Network for Real-Time Semantic Segmentation in Road Scenes
    Lu, Mengxu
    Chen, Zhenxue
    Liu, Chengyun
    Ma, Sile
    Cai, Lei
    Qin, Hao
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (11) : 20991 - 21003
  • [27] Multi-feature Fusion Based on Semantic Understanding Attention Neural Network for Chinese Text Categorization
    Xie Jinbao
    Hou Yongjin
    Kang Shouqiang
    Li Baiwei
    Zhang Xiao
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2018, 40 (05) : 1258 - 1265
  • [28] Research on Chinese Semantic Relation Extraction in Marine Engine Rooms Based on Multi-Feature Fusion
    Liu, Xicai
    Wang, Zhengquan
    Wang, Fubo
    IEEE ACCESS, 2024, 12 : 192013 - 192027
  • [29] Video Captioning with Semantic Guiding
    Yuan, Jin
    Tian, Chunna
    Zhang, Xiangnan
    Ding, Yuxuan
    Wei, Wei
    2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [30] Semantic similarity information discrimination for video captioning
    Du, Sen
    Zhu, Hong
    Xiong, Ge
    Lin, Guangfeng
    Wang, Dong
    Shi, Jing
    Wang, Jing
    Xing, Nan
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 213