Towards Knowledge-Aware Video Captioning via Transitive Visual Relationship Detection

被引：25

作者：

Wu, Bofeng ^{[1
]}

Niu, Guocheng ^{[2
]}

Yu, Jun ^{[1
]}

Xiao, Xinyan ^{[2
]}

Zhang, Jian ^{[3
]}

Wu, Hua ^{[2
]}

机构：

[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou 310018, Peoples R China

[2] Baidu Inc, Beijing 100193, Peoples R China

[3] Zhejiang Int Studies Univ, Sch Int Business, Hangzhou 310023, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Visualization; Task analysis; Semantics; Feature extraction; Decoding; Training; Vocabulary; Video captioning; multi-modal learning; computer vision; natural language process;

D O I：

10.1109/TCSVT.2022.3169894

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video captioning can be enhanced by incorporating the knowledge, which is usually represented as relationships of objects. However, the previous methods construct only superficial or static object relationships, and often introduce noise into the task through irrelevant common sense or fixed syntax templates. These problems mitigate the model interpretability and lead to the undesirable consequence. To overcome these limitations, we propose to enhance video captioning with deep-level object relationships that are adaptively explored during training. Specifically, we present a Transitive Visual Relationship Detection (TVRD) module in which we estimate the actions of the visual objects, and construct an Object-Action Graph (OAG) to describe the shallow relationship between the objects and actions. Then we bridge the gap between the objects via the actions to transitively infer an Object-Object Graph (OOG) which reflects the deep-level relationship. We further feed the OOG to a graph convolutional network to refine the object representation by deep-level relationships. With the refined representation, we capitalize on an LSTM-based decoder for caption generation. Experimental results on two benchmark datasets: MSVD, MSR-VTT demonstrate that the proposed method achieves state-of-the-art performance. Lastly, we present comprehensive ablation studies as well as visualization of visual relationships to demonstrate the effectiveness and interpretability of our model.

引用

页码：6753 / 6765

页数：13

共 19 条

[1] Image-Relevant Entities Knowledge-Aware News Image Captioning
Ajankar, Sonali
Dutta, Tanima
IEEE MULTIMEDIA, 2024, 31 (01) : 88 - 98
[2] Visual Commonsense-Aware Representation Network for Video Captioning
Zeng, Pengpeng
Zhang, Haonan
Gao, Lianli
Li, Xiangpeng
Qian, Jin
Shen, Heng Tao
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (01) : 1092 - 1103
[3] Joint Lesion Detection and Classification of Breast Ultrasound Video via a Clinical Knowledge-Aware Framework
Li, Minglei
Gong, Wushuang
Yan, Pengfei
Li, Xiang
Jiang, Yuchen
Luo, Hao
Zhou, Hang
Yin, Shen
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 45 - 61
[4] Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling
Qi, Mengshi
Wang, Yunhong
Li, Annan
Luo, Jiebo
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (08) : 2617 - 2633
[5] Visual Relation-Aware Unsupervised Video Captioning
Ji, Puzhao
Cao, Meng
Zou, Yuexian
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
[6] Enhancing Graph-Based Semisupervised Learning via Knowledge-Aware Data Embedding
Ienco, Dino
Pensa, Ruggero G.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (11) : 5014 - 5020
[7] Towards Knowledge-Aware and Deep Reinforced Cross-Domain Recommendation Over Collaborative Knowledge Graph
Li, Yakun
Hou, Lei
Li, Juanzi
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 7171 - 7187
[8] Relation-aware attention for video captioning via graph learning
Tu, Yunbin
Zhou, Chang
Guo, Junjun
Li, Huafeng
Gao, Shengxiang
Yu, Zhengtao
PATTERN RECOGNITION, 2023, 136
[9] Towards fine-grained adaptive video captioning via Quality-Aware Recurrent Feedback Network
Xu, Tianyang
Zhang, Yunjie
Song, Xiaoning
Wu, Xiao-Jun
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 261
[10] Context-Aware Emotion Recognition Based on Visual Relationship Detection
Hoang, Manh-Hung
Kim, Soo-Hyung
Yang, Hyung-Jeong
Lee, Guee-Sang
IEEE ACCESS, 2021, 9 : 90465 - 90474

← 1 2 →