VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

被引:0
|
作者
Yamazaki, Kashu [1 ]
Vo, Khoa [1 ]
Truong, Quang Sang [1 ]
Raj, Bhiksha [2 ,3 ]
Le, Ngan [1 ]
机构
[1] Univ Arkansas, AICV Lab, Fayetteville, AR 72701 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
[3] Mohammed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformerin-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
引用
收藏
页码:3081 / 3090
页数:10
相关论文
共 50 条
  • [1] Enhanced-Memory Transformer for Coherent Paragraph Video Captioning
    Cardoso, Leonardo Vilela
    Guimaraes, Silvio Jamil F.
    Patrocinio Jr, Zenilton K. G.
    2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 836 - 840
  • [2] Exploring adaptive attention in memory transformer applied to coherent video paragraph captioning
    Cardoso, Leonardo Vilela
    Guimaraes, Silvio Jamil F.
    Patrocinio, Zenilton K. G., Jr.
    2022 IEEE EIGHTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2022), 2022, : 37 - 44
  • [3] MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
    Lei, Jie
    Wang, Liwei
    Shen, Yelong
    Yu, Dong
    Berg, Tamara L.
    Bansal, Mohit
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2603 - 2614
  • [4] Memory-enhanced hierarchical transformer for video paragraph captioning
    Zhang, Benhui
    Gao, Junyu
    Yuan, Yuan
    NEUROCOMPUTING, 2025, 615
  • [5] STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
    Su, Rui
    Yu, Qian
    Xu, Dong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1513 - 1522
  • [6] Descriptive and Coherent Paragraph Generation for Image Paragraph Captioning Using Vision Transformer and Post-processing
    Vakada, Naveen
    Sekhar, C. Chandra
    ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2023, 2023, 14124 : 40 - 52
  • [7] Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
    Xu Yang
    Hanwang Zhang
    Chongyang Gao
    Jianfei Cai
    International Journal of Computer Vision, 2023, 131 : 82 - 100
  • [8] Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
    Yang, Xu
    Zhang, Hanwang
    Gao, Chongyang
    Cai, Jianfei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (01) : 82 - 100
  • [9] Accelerated masked transformer for dense video captioning
    Yu, Zhou
    Han, Nanjia
    NEUROCOMPUTING, 2021, 445 : 72 - 80
  • [10] Bidirectional transformer with knowledge graph for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (20) : 58309 - 58328