Semantic Enhanced Video Captioning with Multi-feature Fusion

被引:3
|
作者
Niu, Tian-Zi [1 ]
Dong, Shan-Shan [1 ]
Chen, Zhen-Duo [1 ]
Luo, Xin [1 ]
Guo, Shanqing [2 ]
Huang, Zi [3 ]
Xu, Xin-Shun [1 ]
机构
[1] Shandong Univ, Sch Software, Jinan 250101, Peoples R China
[2] Shandong Univ, Sch Cyber Sci & Technol, Qingdao 266237, Peoples R China
[3] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Australia
基金
中国国家自然科学基金;
关键词
Video captioning; semantic encoder; discrete selection; multi-feature fusion; NETWORK;
D O I
10.1145/3588572
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g., semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video-captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Second, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusionmodule uses a novel relation-aware attentionmechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-endmanner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] DeepFireNet: A real-time video fire detection method based on multi-feature fusion
    Zhang, Bin
    Sun, Linkun
    Song, Yingjie
    Shao, Weiping
    Guo, Yan
    Yuan, Fang
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2020, 17 (06) : 7804 - 7818
  • [32] Multi-Feature Fusion 3D-CNN for Tooth Segmentation
    Rao, Yunbo
    Gou, Miao
    Wang, Yilin
    Chen, Zening
    Xue, Junmin
    Sun, Jianxun
    Wang, Zairong
    TWELFTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2020), 2021, 11720
  • [33] Unknown Traffic Recognition Based on Multi-Feature Fusion and Incremental Learning
    Liu, Junyi
    Wang, Jiarong
    Yan, Tian
    Qi, Fazhi
    Chen, Gang
    APPLIED SCIENCES-BASEL, 2023, 13 (13):
  • [34] Center-enhanced video captioning model with multimodal semantic alignment
    Zhang, Benhui
    Gao, Junyu
    Yuan, Yuan
    NEURAL NETWORKS, 2024, 180
  • [35] HIERARCHICAL MULTI-FEATURE FUSION FOR MULTIMODAL DATA ANALYSIS
    Zhang, Hong
    Chen, Li
    Liu, Jun
    Yuan, Junsong
    2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 5916 - 5920
  • [36] Adaptive Multi-Feature Fusion for Underwater Diver Classification
    Yang Juan
    Xu Feng
    Wei Zhiheng
    An Xudong
    Liu Jia
    Ji Yongqiang
    Wen Tao
    2013 IEEE/OES ACOUSTICS IN UNDERWATER GEOSCIENCES SYMPOSIUM (RIO ACOUSTICS 2013), 2013,
  • [37] Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
    Sun, Zhixin
    Zhong, Xian
    Chen, Shuqin
    Liu, Wenxuan
    Feng, Duxiu
    Li, Lin
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 677 - 689
  • [38] Adaptive Multi-feature Fusion for Correlation Filter Tracking
    Liu, Linfeng
    Yan, Xiaole
    Shen, Qiu
    COMMUNICATIONS, SIGNAL PROCESSING, AND SYSTEMS, 2019, 463 : 1057 - 1066
  • [39] An adaptive KCF tracking via multi-feature fusion
    Guo De-quan
    Peng Sheng
    Ling Sheng-gui
    Yang Hong-yu
    Liu Hong
    2017 INTERNATIONAL CONFERENCE ON VIRTUAL REALITY AND VISUALIZATION (ICVRV 2017), 2017, : 255 - 260
  • [40] VideoGIS Data Retrieval Based on Multi-feature Fusion
    Dai, Haihong
    Hu, Bin
    Cui, Qian
    Zou, Zhiqiang
    2017 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING (IEEE ISKE), 2017,