Time-frequency recurrent transformer with diversity constraint for dense video captioning

被引：9

作者：

Li, Ping ^{[1
,2
]}

Zhang, Pan ^{[1
]}

Wang, Tao ^{[1
]}

Xiao, Huaxin ^{[3
]}

机构：

[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China

[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China

[3] Natl Univ Def Technol, Dept Syst Engn, Changsha, Peoples R China

来源：

INFORMATION PROCESSING & MANAGEMENT | 2023年 / 60卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Dense video captioning; Transformer; Diversity; Time-frequency domain;

D O I：

10.1016/j.ipm.2022.103204

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Describing a long video using multiple sentences, i.e., dense video captioning, is a very challenging task. Existing methods neglect the important fact that the actions of several tempos (a.k.a., frequencies) evolve with the time in video, and do not well handle the phrase repetition issue. Therefore, we propose a Time -Frequency recurrent Transformer with Diversity constraint (TFTD) for dense video captioning. Its basic idea is to develop a time-frequency memory module, which not only stores the history of the past sentences and corresponding video segments to consider the temporal relations, but also models the motion dependency of action patterns with different frequencies. This contributes to producing more coherent sentences that well describe the video content. Moreover, we adopt the Determinantal Point Processes (DDP) for designing the diversity loss imposed to the objective function as constraint, such that the generated sentences are diverse with less redundancy. Extensive experiments on two benchmark datasets have verified the superior performance of our approach, e.g., it achieves 11.36, 16.56, 26.16, and 3.77, in terms of BLEU@4, METEOR, CIDEr-D, and R@4, respectively, on ActivityNet Captions (20,000 videos). Besides, TFTD outperforms the most competitive alternative by 8.0% on ActivityNet Captions and 9.8% on YouCookII (2,000 videos) in terms of coherence for human evaluation.

引用

页数：17

共 69 条

[1]

Aafaq Nayyer, 2022, IEEE T MULTIMEDIA TM, V14, P1

[2] Unsupervised Learning from Narrated Instruction Videos [J].

Alayrac, Jean-Baptiste ;

Bojanowski, Piotr ;

Agrawal, Nishant ;

Sivic, Josef ;

Laptev, Ivan ;

Lacoste-Julien, Simon .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4575-4583

[3]

[Anonymous], 2017, ICML P MACHINE LEARN

[4]

[Anonymous], 2015, P 2015 ANN C N AM CH

[5]

[Anonymous], 1997, P NAT C ART INT 9 C

[6]

Ba JL, 2016, arXiv

[7] Discriminative Latent Semantic Graph for Video Captioning [J].

Bai, Yang ;

Wang, Junyan ;

Long, Yang ;

Hu, Bingzhang ;

Song, Yang ;

Pagnucco, Maurice ;

Guan, Yu .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :3556-3564

[8] Hierarchical Boundary-Aware Neural Encoder for Video Captioning [J].

Baraldi, Lorenzo ;

Grana, Costantino ;

Cucchiara, Rita .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3185-3194

[9] Critic-based Attention Network for Event-based Video Captioning [J].

Barati, Elaheh ;

Chen, Xuewen .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :811-817

[10] BidirectionalLong-Short Term Memory for Video Description [J].

Bin, Yi ;

Yang, Yang ;

Shen, Fumin ;

Xu, Xing ;

Shen, Heng Tao .

MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, :436-440

← 1 2 3 4 5 6 7 →