Video captioning with global and local text attention

被引：1

作者：

Peng, Yuqing ^{[1
]}

Wang, Chenxi ^{[1
]}

Pei, Yixin ^{[1
]}

Li, Yingjun ^{[1
]}

机构：

[1] Hebei Univ Technol, Sch Artificial Intelligence, Tianjin 300401, Peoples R China

来源：

VISUAL COMPUTER | 2022年 / 38卷 / 12期

关键词：

Video captioning; Global control; Local strengthening; Bidirectional; AGGREGATION;

D O I：

10.1007/s00371-021-02294-0

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

The task of video captioning is to generate a video description corresponding to the video content, so there are stringent requirements for the extraction of fine-grained video features and the language processing of tag text. A new method using global control of the text and local strengthening during training is proposed in this paper. In this method, the context can be referred to when the training generates text. In addition, more attention is given to important words in the text, such as nouns and predicate verbs, and this approach greatly improves the recognition of objects and provides more accurate prediction of actions in the video. Moreover, in this paper, the authors adopt 2D and 3D multimodal feature extraction for the process of video feature extraction. Better results are achieved by the fine-grained feature capture of global attention and the fusion of bidirectional time flow. The method in this paper obtains good results on both the MSR-VTT and MSVD datasets.

引用

页码：4267 / 4278

页数：12

共 50 条

[31] Multimodal-enhanced hierarchical attention network for video captioning
Zhong, Maosheng
Chen, Youde
Zhang, Hao
Xiong, Hao
Wang, Zhixiang
MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
[32] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
NEUROCOMPUTING, 2018, 315 : 362 - 370
[33] Attention-based Densely Connected LSTM for Video Captioning
Zhu, Yongqing
Jiang, Shuqiang
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 802 - 810
[34] Multimodal-enhanced hierarchical attention network for video captioning
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
Multimedia Systems, 2023, 29 : 2469 - 2482
[35] Multimodal architecture for video captioning with memory networks and an attention mechanism
Li, Wei
Guo, Dashan
Fang, Xiangzhong
PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
[36] Syntax-Guided Hierarchical Attention Network for Video Captioning
Deng, Jincan
Li, Liang
Zhang, Beichen
Wang, Shuhui
Zha, Zhengjun
Huang, Qingming
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) : 880 - 892
[37] Video Captioning using Hierarchical Multi-Attention Model
Xiao, Huanhou
Shi, Jinglun
ICAIP 2018: 2018 THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN IMAGE PROCESSING, 2018, : 96 - 101
[38] Stacked Multimodal Attention Network for Context-Aware Video Captioning
Zheng, Yi
Zhang, Yuejie
Feng, Rui
Zhang, Tao
Fan, Weiguo
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
[39] Relation-aware attention for video captioning via graph learning
Tu, Yunbin
Zhou, Chang
Guo, Junjun
Li, Huafeng
Gao, Shengxiang
Yu, Zhengtao
PATTERN RECOGNITION, 2023, 136
[40] Video Captioning by Learning from Global Sentence and Looking Ahead
Niu, Tian-Zi
Chen, Zhen-Duo
Luo, Xin
Zhang, Peng-Fei
Huang, Zi
Xu, Xin-Shun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)

← 1 2 3 4 5 →