Action-guided prompt tuning for video grounding

被引:0
|
作者
Wang, Jing [1 ]
Tsao, Raymon [2 ]
Wang, Xuan [1 ]
Wang, Xiaojie [1 ]
Feng, Fangxiang [1 ]
Tian, Shiyu [1 ]
Poria, Soujanya [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China
[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China
[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore
基金
中国国家自然科学基金;
关键词
video grounding; Multi-modal learning; Prompt tuning; Temporal information;
D O I
10.1016/j.inffus.2024.102577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
引用
收藏
页数:10
相关论文
共 50 条
  • [11] Consistent Prompt Tuning for Generalized Category Discovery
    Yang, Muli
    Yin, Jie
    Gu, Yanan
    Deng, Cheng
    Zhang, Hanwang
    Zhu, Hongyuan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, : 4014 - 4041
  • [12] ASR MODEL ADAPTATION WITH DOMAIN PROMPT TUNING
    Zou, Pengpeng
    Ye, Jianhao
    Zhou, Hongbin
    2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 406 - 410
  • [13] PTAU: Prompt Tuning for Attributing Unanswerable Questions
    Liao, Jinzhi
    Zhao, Xiang
    Zheng, Jianming
    Li, Xinyi
    Cai, Fei
    Tang, Jiuyang
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1219 - 1229
  • [14] Prompt Tuning in Code Intelligence: An Experimental Evaluation
    Wang, Chaozheng
    Yang, Yuanhang
    Gao, Cuiyun
    Peng, Yun
    Zhang, Hongyu
    Lyu, Michael R.
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2023, 49 (11) : 4869 - 4885
  • [15] PTR: Prompt Tuning with Rules for Text Classification
    Han, Xu
    Zhao, Weilin
    Ding, Ning
    Liu, Zhiyuan
    Sun, Maosong
    AI OPEN, 2022, 3 : 182 - 192
  • [16] No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence
    Wang, Chaozheng
    Yang, Yuanhang
    Gao, Cuiyun
    Peng, Yun
    Zhang, Hongyu
    Lyu, Michael R.
    PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 382 - 394
  • [17] Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems
    Xu, Yifang
    Sun, Yunzhuo
    Xie, Zien
    Zhai, Benxiang
    Jia, Youyao
    Du, Sidan
    INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2023, 19 (01)
  • [18] Learning Comprehensive Visual Grounding for Video Captioning
    Jiang, Wenhui
    Liu, Linxin
    Fang, Yuming
    Cheng, Yibo
    Peng, Yuxin
    Liu, Yang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (04) : 3355 - 3367
  • [19] Modality-Consistent Prompt Tuning With Optimal Transport
    Ren, Hairui
    Tang, Fan
    Zheng, Huangjie
    Zhao, He
    Guo, Dandan
    Chang, Yi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2499 - 2512
  • [20] Black-Box Prompt Tuning With Subspace Learning
    Zheng, Yuanhang
    Tan, Zhixing
    Li, Peng
    Liu, Yang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3002 - 3013