Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning

被引：18

作者：

Xu, Wanru ^{[1
,2
]}

Miao, Zhenjiang ^{[1
,2
]}

Yu, Jian ^{[3
,4
]}

Tian, Yi ^{[3
,4
]}

Wan, Lili ^{[1
,2
]}

Ji, Qiang ^{[5
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China

[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China

[3] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing 100044, Peoples R China

[4] Beijing Key Lab Traff Data Anal & Min, Beijing 100044, Peoples R China

[5] Rensselaer Polytech Inst, Dept Elect & Comp Engn, Troy, NY 12180 USA

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 09期

关键词：

Semantics; Visualization; Decoding; Transformers; Task analysis; Planning; Training; Video captioning; transformer; polishing mechanism; cross-modal modeling; NETWORKS;

D O I：

10.1109/TCSVT.2022.3165934

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video captioning is a joint task of computer vision and natural language processing, which aims to describe the video content using several natural language sentences. Nowadays, most methods cast this task as a mapping problem, which learns a mapping from visual features to natural language and generates captions directly from videos. However, the underlying challenge of video captioning, i.e., sequence to sequence mapping across the different domains, is still not well handled. To address these problems, we introduce the polishing mechanism in an attempt to mimic human polishing process and propose a generate-and-polish framework for video captioning. In this paper, we propose a two-step transformer based polishing network (TSTPN) consisting of two sub-modules: the generation-module is to generate the caption candidate and the polishing-module is to gradually refine the generated candidate. Specifically, the candidate provides a global information of the visual contents in a semantically-meaningful order, where it is firstly considered as a semantic intersnubber to bridge the semantic gap between the text and video, with the cross-modal attention mechanism for better cross-modal modeling; and it secondly provides a global planning ability to maintain the semantic consistency and fluency of the whole sentence for better sequence mapping. In experiments, we present adequate evaluations to show that the proposed TSTPN achieves the comparable and even better performance than the state-of-the-art methods on the benchmark datasets.

引用

页码：6293 / 6307

页数：15

共 58 条

[21] End-to-End Video Captioning with Multitask Reinforcement Learning [J].

Li, Lijun ;

Gong, Boqing .

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :339-348

[22] Multimodal architecture for video captioning with memory networks and an attention mechanism [J].

Li, Wei ;

Guo, Dashan ;

Fang, Xiangzhong .

PATTERN RECOGNITION LETTERS, 2018, 105 :23-29

[23] Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog [J].

Li, Zekang ;

Li, Zongjia ;

Zhang, Jinchao ;

Feng, Yang ;

Zhou, Jie .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :2476-2483

[24]

Lin K, 2021, AAAI CONF ARTIF INTE, V35, P2047

[25] VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [J].

Lin, Xudong ;

Bertasius, Gedas ;

Wang, Jue ;

Chang, Shih-Fu ;

Parikh, Devi ;

Torresani, Lorenzo .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :7001-7011

[26]

Liu FL, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P281

[27] SibNet: Sibling Convolutional Encoder for Video Captioning [J].

Liu, Sheng ;

Ren, Zhou ;

Yuan, Junsong .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (09) :3259-3272

[28] Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [J].

Pan, Boxiao ;

Cai, Haoye ;

Huang, De-An ;

Lee, Kuan-Hui ;

Gaidon, Adrien ;

Adeli, Ehsan ;

Niebles, Juan Carlos .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10867-10876

[29] BLEU: a method for automatic evaluation of machine translation [J].

Papineni, K ;

Roukos, S ;

Ward, T ;

Zhu, WJ .

40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, :311-318

[30]

Radford A., 2019, Technical report, V1, P9

← 1 2 3 4 5 6 →