Action-guided prompt tuning for video grounding

被引:0
|
作者
Wang, Jing [1 ]
Tsao, Raymon [2 ]
Wang, Xuan [1 ]
Wang, Xiaojie [1 ]
Feng, Fangxiang [1 ]
Tian, Shiyu [1 ]
Poria, Soujanya [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China
[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China
[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore
基金
中国国家自然科学基金;
关键词
video grounding; Multi-modal learning; Prompt tuning; Temporal information;
D O I
10.1016/j.inffus.2024.102577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] PTSTEP: Prompt Tuning for Semantic Typing of Event Processes
    Zhu, Wenhao
    Xu, Yongxiu
    Xu, Hongbo
    Tang, Minghao
    Zhu, Dongwei
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT III, 2023, 14256 : 541 - 553
  • [22] Judicial Text Relation Extraction Based on Prompt Tuning
    Chen, Xue
    Li, Yi
    Fan, Shuhuan
    Hou, Mengshu
    2024 2ND ASIA CONFERENCE ON COMPUTER VISION, IMAGE PROCESSING AND PATTERN RECOGNITION, CVIPPR 2024, 2024,
  • [23] Progressive Multi-modal Conditional Prompt Tuning
    Qiu, Xiaoyu
    Feng, Hao
    Wang, Yuechen
    Zhou, Wengang
    Li, Houqiang
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 46 - 54
  • [24] Adversarial Prompt Tuning for Vision-Language Models
    Zhang, Jiaming
    Ma, Xingjun
    Wang, Xin
    Qiu, Lingyu
    Wang, Jiaqi
    Jiang, Yu-Gang
    Sang, Jitao
    COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
  • [25] LIPT: Improving Prompt Tuning with Late Inception Reparameterization
    He, Yawen
    Feng, Ao
    Gao, Zhengjie
    Song, Xinyu
    ELECTRONICS, 2024, 13 (23):
  • [26] Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding
    Yu, Zhou
    Song, Yijun
    Yu, Jun
    Wang, Meng
    Huang, Qingming
    NEURAL PROCESSING LETTERS, 2020, 52 (03) : 1863 - 1879
  • [27] Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding
    Zhou Yu
    Yijun Song
    Jun Yu
    Meng Wang
    Qingming Huang
    Neural Processing Letters, 2020, 52 : 1863 - 1879
  • [28] Clickbait Detection via Prompt-Tuning With Titles Only
    Wang, Ye
    Zhu, Yi
    Li, Yun
    Qiang, Jipeng
    Yuan, Yunhao
    Wu, Xindong
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2025, 9 (01): : 695 - 705
  • [29] PTCAS: Prompt tuning with continuous answer search for relation extraction
    Chen, Yang
    Shi, Bowen
    Xu, Ke
    INFORMATION SCIENCES, 2024, 659
  • [30] Iterative Soft Prompt-Tuning for Unsupervised Domain Adaptation
    Zhu, Yi
    Wang, Shuqin
    Qiang, Jipeng
    Wu, Xindong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 8580 - 8592