VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

被引：0

作者：

Yamazaki, Kashu ^{[1
]}

Vo, Khoa ^{[1
]}

Truong, Quang Sang ^{[1
]}

Raj, Bhiksha ^{[2
,3
]}

Le, Ngan ^{[1
]}

机构：

[1] Univ Arkansas, AICV Lab, Fayetteville, AR 72701 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA USA

[3] Mohammed Bin Zayed Univ AI, Abu Dhabi, U Arab Emirates

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformerin-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.

引用

页码：3081 / 3090

页数：10

共 50 条

[1] Enhanced-Memory Transformer for Coherent Paragraph Video Captioning
Cardoso, Leonardo Vilela
Guimaraes, Silvio Jamil F.
Patrocinio Jr, Zenilton K. G.
2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 836 - 840
[2] Exploring adaptive attention in memory transformer applied to coherent video paragraph captioning
Cardoso, Leonardo Vilela
Guimaraes, Silvio Jamil F.
Patrocinio, Zenilton K. G., Jr.
2022 IEEE EIGHTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2022), 2022, : 37 - 44
[3] MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Lei, Jie
Wang, Liwei
Shen, Yelong
Yu, Dong
Berg, Tamara L.
Bansal, Mohit
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2603 - 2614
[4] Memory-enhanced hierarchical transformer for video paragraph captioning
Zhang, Benhui
Gao, Junyu
Yuan, Yuan
NEUROCOMPUTING, 2025, 615
[5] STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
Su, Rui
Yu, Qian
Xu, Dong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1513 - 1522
[6] Descriptive and Coherent Paragraph Generation for Image Paragraph Captioning Using Vision Transformer and Post-processing
Vakada, Naveen
Sekhar, C. Chandra
ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2023, 2023, 14124 : 40 - 52
[7] Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
Xu Yang
Hanwang Zhang
Chongyang Gao
Jianfei Cai
International Journal of Computer Vision, 2023, 131 : 82 - 100
[8] Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
Yang, Xu
Zhang, Hanwang
Gao, Chongyang
Cai, Jianfei
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (01) : 82 - 100
[9] Accelerated masked transformer for dense video captioning
Yu, Zhou
Han, Nanjia
NEUROCOMPUTING, 2021, 445 : 72 - 80
[10] Bidirectional transformer with knowledge graph for video captioning
Zhong, Maosheng
Chen, Youde
Zhang, Hao
Xiong, Hao
Wang, Zhixiang
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (20) : 58309 - 58328

← 1 2 3 4 5 →