Multi-scale features with temporal information guidance for video captioning

被引：2

作者：

Zhao, Hong ^{[1
]}

Chen, Zhiwen ^{[2
]}

Yang, Yi ^{[2
,3
]}

机构：

[1] Lanzhou Univ Technol, Sch Comp & Commun, Lanzhou 730050, Gansu, Peoples R China

[2] Lanzhou Univ, Sch Informat Sci Engn, Lanzhou 730000, Gansu, Peoples R China

[3] Key Lab Artificial Intelligence & Comp Power Techn, Lanzhou, Gansu, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 137卷

基金：

中国国家自然科学基金;

关键词：

Video captioning; Multi-scale features; Temporal information; Transformer; Gated decoding module; TRANSFORMER; NETWORK;

D O I：

10.1016/j.engappai.2024.109102

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video captioning aims to automatically generate a textual description for a video, which is a challenging task and has drawn attention recently. Despite existing methods have achieved impressive performance, two challenging problems are remaining to be solved. (1) In the feature encoding stage, existing methods only focus on local features or global features to improve the accuracy or readability of sentences generated, resulting in the underutilization of useful information for the given video. (2) In the decoder stage, vanilla Transformer is usually used to reason about visual relations to generate the textual captions, which is not making good use of the inter-frame temporal information, leads to the relation ambiguity and bad readability for generated captions. To solve these problems, we propose a method of video captioning based on multi-scale feature with temporal information guidance for video captioning. Firstly, the pre-training model CLIP is employed to extract video features. Secondly, the global and local features are encoded separately to learn the overall and detailed information of the video and construct multi-scale features. Finally, the gating unit is used to alleviate the problem which cannot make good use of contextual temporal information in existing decoder module base Transformer. Extensive experiments on two publicly available datasets show that the proposed model improves 4.7%, 2.2%, 0.6%, 2.0% on the MSR-VTT dataset, and 5.1%, 9.0%, 5.8%, 6.7% on the MSVD dataset compared to the best model in the comparison method in the BLEU, METEOR, ROUGE-L, and CIDEr metrics, which demonstrates the ability of our method to achieve more competitive performance.

引用

页数：10

共 54 条

[21] Interaction augmented transformer with decoupled decoding for video captioning q [J].

Jin, Tao ;

Zhao, Zhou ;

Wang, Peng ;

Yu, Jun ;

Wu, Fei .

NEUROCOMPUTING, 2022, 492 :496-507

[22] Natural language description of human activities from video images based on concept hierarchy of actions [J].

Kojima, A ;

Tamura, T ;

Fukunaga, K .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2002, 50 (02) :171-184

[23]

Lei J, 2020, Arxiv, DOI arXiv:2005.05402

[24] Video Captioning Based on Channel Soft Attention and Semantic Reconstructor [J].

Lei, Zhou ;

Huang, Yiyong .

FUTURE INTERNET, 2021, 13 (02) :1-18

[25] Long Short-Term Relation Transformer With Global Gating for Video Captioning [J].

Li, Liang ;

Gao, Xingyu ;

Deng, Jincan ;

Tu, Yunbin ;

Zha, Zheng-Jun ;

Huang, Qingming .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 :2726-2738

[26] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning [J].

Lin, Kevin ;

Li, Linjie ;

Lin, Chung-Ching ;

Ahmed, Faisal ;

Gan, Zhe ;

Liu, Zicheng ;

Lu, Yumao ;

Wang, Lijuan .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :17928-17937

[27] Video Swin Transformer [J].

Liu, Ze ;

Ning, Jia ;

Cao, Yue ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Hu, Han .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :3192-3201

[28] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].

Liu, Ze ;

Lin, Yutong ;

Cao, Yue ;

Hu, Han ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Guo, Baining .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002

[29] CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning [J].

Luo, Huaishao ;

Ji, Lei ;

Zhong, Ming ;

Chen, Yang ;

Lei, Wen ;

Duan, Nan ;

Li, Tianrui .

NEUROCOMPUTING, 2022, 508 :293-304

[30]

Mingxing Wang, 2020, 2020 IEEE 3rd International Conference on Computer and Communication Engineering Technology (CCET), P10, DOI 10.1109/CCET50901.2020.9213129

← 1 2 3 4 5 6 →