Towards Knowledge-Aware Video Captioning via Transitive Visual Relationship Detection

被引:25
|
作者
Wu, Bofeng [1 ]
Niu, Guocheng [2 ]
Yu, Jun [1 ]
Xiao, Xinyan [2 ]
Zhang, Jian [3 ]
Wu, Hua [2 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou 310018, Peoples R China
[2] Baidu Inc, Beijing 100193, Peoples R China
[3] Zhejiang Int Studies Univ, Sch Int Business, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Task analysis; Semantics; Feature extraction; Decoding; Training; Vocabulary; Video captioning; multi-modal learning; computer vision; natural language process;
D O I
10.1109/TCSVT.2022.3169894
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video captioning can be enhanced by incorporating the knowledge, which is usually represented as relationships of objects. However, the previous methods construct only superficial or static object relationships, and often introduce noise into the task through irrelevant common sense or fixed syntax templates. These problems mitigate the model interpretability and lead to the undesirable consequence. To overcome these limitations, we propose to enhance video captioning with deep-level object relationships that are adaptively explored during training. Specifically, we present a Transitive Visual Relationship Detection (TVRD) module in which we estimate the actions of the visual objects, and construct an Object-Action Graph (OAG) to describe the shallow relationship between the objects and actions. Then we bridge the gap between the objects via the actions to transitively infer an Object-Object Graph (OOG) which reflects the deep-level relationship. We further feed the OOG to a graph convolutional network to refine the object representation by deep-level relationships. With the refined representation, we capitalize on an LSTM-based decoder for caption generation. Experimental results on two benchmark datasets: MSVD, MSR-VTT demonstrate that the proposed method achieves state-of-the-art performance. Lastly, we present comprehensive ablation studies as well as visualization of visual relationships to demonstrate the effectiveness and interpretability of our model.
引用
收藏
页码:6753 / 6765
页数:13
相关论文
共 19 条
  • [1] Image-Relevant Entities Knowledge-Aware News Image Captioning
    Ajankar, Sonali
    Dutta, Tanima
    IEEE MULTIMEDIA, 2024, 31 (01) : 88 - 98
  • [2] Visual Commonsense-Aware Representation Network for Video Captioning
    Zeng, Pengpeng
    Zhang, Haonan
    Gao, Lianli
    Li, Xiangpeng
    Qian, Jin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (01) : 1092 - 1103
  • [3] Joint Lesion Detection and Classification of Breast Ultrasound Video via a Clinical Knowledge-Aware Framework
    Li, Minglei
    Gong, Wushuang
    Yan, Pengfei
    Li, Xiang
    Jiang, Yuchen
    Luo, Hao
    Zhou, Hang
    Yin, Shen
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 45 - 61
  • [4] Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling
    Qi, Mengshi
    Wang, Yunhong
    Li, Annan
    Luo, Jiebo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (08) : 2617 - 2633
  • [5] Visual Relation-Aware Unsupervised Video Captioning
    Ji, Puzhao
    Cao, Meng
    Zou, Yuexian
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
  • [6] Enhancing Graph-Based Semisupervised Learning via Knowledge-Aware Data Embedding
    Ienco, Dino
    Pensa, Ruggero G.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (11) : 5014 - 5020
  • [7] Towards Knowledge-Aware and Deep Reinforced Cross-Domain Recommendation Over Collaborative Knowledge Graph
    Li, Yakun
    Hou, Lei
    Li, Juanzi
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 7171 - 7187
  • [8] Relation-aware attention for video captioning via graph learning
    Tu, Yunbin
    Zhou, Chang
    Guo, Junjun
    Li, Huafeng
    Gao, Shengxiang
    Yu, Zhengtao
    PATTERN RECOGNITION, 2023, 136
  • [9] Towards fine-grained adaptive video captioning via Quality-Aware Recurrent Feedback Network
    Xu, Tianyang
    Zhang, Yunjie
    Song, Xiaoning
    Wu, Xiao-Jun
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 261
  • [10] Context-Aware Emotion Recognition Based on Visual Relationship Detection
    Hoang, Manh-Hung
    Kim, Soo-Hyung
    Yang, Hyung-Jeong
    Lee, Guee-Sang
    IEEE ACCESS, 2021, 9 : 90465 - 90474