Time-frequency recurrent transformer with diversity constraint for dense video captioning

被引:9
|
作者
Li, Ping [1 ,2 ]
Zhang, Pan [1 ]
Wang, Tao [1 ]
Xiao, Huaxin [3 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China
[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[3] Natl Univ Def Technol, Dept Syst Engn, Changsha, Peoples R China
基金
中国国家自然科学基金;
关键词
Dense video captioning; Transformer; Diversity; Time-frequency domain;
D O I
10.1016/j.ipm.2022.103204
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Describing a long video using multiple sentences, i.e., dense video captioning, is a very challenging task. Existing methods neglect the important fact that the actions of several tempos (a.k.a., frequencies) evolve with the time in video, and do not well handle the phrase repetition issue. Therefore, we propose a Time -Frequency recurrent Transformer with Diversity constraint (TFTD) for dense video captioning. Its basic idea is to develop a time-frequency memory module, which not only stores the history of the past sentences and corresponding video segments to consider the temporal relations, but also models the motion dependency of action patterns with different frequencies. This contributes to producing more coherent sentences that well describe the video content. Moreover, we adopt the Determinantal Point Processes (DDP) for designing the diversity loss imposed to the objective function as constraint, such that the generated sentences are diverse with less redundancy. Extensive experiments on two benchmark datasets have verified the superior performance of our approach, e.g., it achieves 11.36, 16.56, 26.16, and 3.77, in terms of BLEU@4, METEOR, CIDEr-D, and R@4, respectively, on ActivityNet Captions (20,000 videos). Besides, TFTD outperforms the most competitive alternative by 8.0% on ActivityNet Captions and 9.8% on YouCookII (2,000 videos) in terms of coherence for human evaluation.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Accelerated masked transformer for dense video captioning
    Yu, Zhou
    Han, Nanjia
    NEUROCOMPUTING, 2021, 445 : 72 - 80
  • [2] Position embedding fusion on transformer for dense video captioning
    Yang, Sixuan
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
  • [3] Parallel Pathway Dense Video Captioning With Deformable Transformer
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    IEEE ACCESS, 2022, 10 : 129899 - 129910
  • [4] End-to-End Dense Video Captioning with Masked Transformer
    Zhou, Luowei
    Zhou, Yingbo
    Corso, Jason J.
    Socher, Richard
    Xiong, Caiming
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8739 - 8748
  • [5] Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning
    Man, Xin
    Ouyang, Deqiang
    Li, Xiangpeng
    Song, Jingkuan
    Shao, Jie
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)
  • [6] MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
    Lei, Jie
    Wang, Liwei
    Shen, Yelong
    Yu, Dong
    Berg, Tamara L.
    Bansal, Mohit
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2603 - 2614
  • [7] A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning
    Artham, Sainithin
    Shaikh, Soharab Hossain
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 64037 - 64056
  • [8] Hierarchical Time-Aware Summarization with an Adaptive Transformer for Video Captioning
    Cardoso, Leonardo Vilela
    Guimaraes, Silvio Jamil Ferzoli
    do Patrocinio Jr, Zenilton Kleber Goncalves
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (04) : 569 - 592
  • [9] Time-frequency vibration analysis of the transformer construction
    Kornatowski, Eugeniusz
    PRZEGLAD ELEKTROTECHNICZNY, 2012, 88 (11B): : 268 - 271
  • [10] MODELING BEATS AND DOWNBEATS WITH A TIME-FREQUENCY TRANSFORMER
    Hung, Yun-Ning
    Wang, Ju-Chiang
    Song, Xuchen
    Lu, Wei-Tsung
    Won, Minz
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 401 - 405