Action-guided prompt tuning for video grounding

被引:0
|
作者
Wang, Jing [1 ]
Tsao, Raymon [2 ]
Wang, Xuan [1 ]
Wang, Xiaojie [1 ]
Feng, Fangxiang [1 ]
Tian, Shiyu [1 ]
Poria, Soujanya [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China
[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China
[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore
基金
中国国家自然科学基金;
关键词
video grounding; Multi-modal learning; Prompt tuning; Temporal information;
D O I
10.1016/j.inffus.2024.102577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] SPTNET: Span-based Prompt Tuning for Video Grounding
    Zhang, Yiren
    Xu, Yuanwu
    Chen, Mohan
    Zhang, Yuejie
    Feng, Rui
    Gao, Shang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2807 - 2812
  • [2] Video-Guided Curriculum Learning for Spoken Video Grounding
    Xia, Yan
    Zhao, Zhou
    Ye, Shangwei
    Zhao, Yang
    Li, Haoyuan
    Ren, Yi
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5191 - 5200
  • [3] POET: Prompt Offset Tuning for Continual Human Action Adaptation
    Garg, Prachi
    Joseph, K. J.
    Balasubramanian, Vineeth N.
    Camgoz, Necati Cihan
    Wan, Chengde
    King, Kenrick
    Si, Weiguang
    Ma, Shugao
    De La Torre, Fernando
    COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 436 - 455
  • [4] UMP: Unified Modality-Aware Prompt Tuning for Text-Video Retrieval
    Zhang, Haonan
    Zeng, Pengpeng
    Gao, Lianli
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11954 - 11964
  • [5] When Adversarial Training Meets Prompt Tuning: Adversarial Dual Prompt Tuning for Unsupervised Domain Adaptation
    Cui, Chaoran
    Liu, Ziyi
    Gong, Shuai
    Zhu, Lei
    Zhang, Chunyun
    Liu, Hui
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 1427 - 1440
  • [6] Enhancing Visual-Language Prompt Tuning Through Sparse Knowledge-Guided Context Optimization
    Tian, Qiangxing
    Zhang, Min
    ENTROPY, 2025, 27 (03)
  • [7] PTE: Prompt tuning with ensemble verbalizers
    Liang, Liheng
    Wang, Guancheng
    Lin, Cong
    Feng, Zhuowen
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 262
  • [8] Prompt Tuning in Biomedical Relation Extraction
    He, Jianping
    Li, Fang
    Li, Jianfu
    Hu, Xinyue
    Nian, Yi
    Xiang, Yang
    Wang, Jingqi
    Wei, Qiang
    Li, Yiming
    Xu, Hua
    Tao, Cui
    JOURNAL OF HEALTHCARE INFORMATICS RESEARCH, 2024, 8 (02) : 206 - 224
  • [9] Prompt Tuning in Biomedical Relation Extraction
    Jianping He
    Fang Li
    Jianfu Li
    Xinyue Hu
    Yi Nian
    Yang Xiang
    Jingqi Wang
    Qiang Wei
    Yiming Li
    Hua Xu
    Cui Tao
    Journal of Healthcare Informatics Research, 2024, 8 : 206 - 224
  • [10] G-Prompt: Graphon-based Prompt Tuning for graph classification
    Duan, Yutai
    Liu, Jie
    Chen, Shaowei
    Chen, Liyi
    Wu, Jianhua
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (03)