Training-Free Video Temporal Grounding Using Large-Scale Pre-trained Models

被引：0

作者：

Zheng, Minghang ^{[1
]}

Cai, Xinhao ^{[1
]}

Chen, Qingchao ^{[2
]}

Peng, Yuxin ^{[1
]}

Liu, Yang ^{[1
,3
]}

机构：

[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China

[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing, Peoples R China

[3] Peking Univ, State Key Lab Gen Artificial Intelligence, Beijing, Peoples R China

来源：

COMPUTER VISION-ECCV 2024, PT LXXXII | 2025年 / 15140卷

基金：

中国国家自然科学基金;

关键词：

Video Temporal Grounding; Zero-shot Learning; Large Language Model; Vision Language Model;

D O I：

10.1007/978-3-031-73007-8_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training, with high data collection costs, but exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, firstly, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description provided by LLMs, we use VLMs to locate the top-k proposals that are most relevant to the description and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings. Code is available at https://github.com/minghangz/TFVTG.

引用

页码：20 / 37

页数：18

共 64 条

[1] Achiam J., 2023, Gpt-4 technical report, DOI 10.48550/arXiv.2303.08774
[2] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Bain, Max
Nagrani, Arsha
Varol, Gul
Zisserman, Andrew
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1708 - 1718
[3] Bao P., 2022, arXiv
[4] Collins R.T., 2000, VSAM Final Rep., P1
[5] Moment Detection in Long Tutorial Videos
Croitoru, Ioana
Bogolin, Simion-Vlad
Albanie, Samuel
Liu, Yang
Wang, Zhaowen
Yoon, Seunghyun
Dernoncourt, Franck
Jin, Hailin
Bui, Trung
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2594 - 2604
[6] Duan X, 2018, ADV NEUR IN, V31
[7] TALL: Temporal Activity Localization via Language Query
Gao, Jiyang
Sun, Chen
Yang, Zhenheng
Nevatia, Ram
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5277 - 5285
[8] Learning Video Moment Retrieval Without a Single Annotated Video
Gao, Junyu
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1646 - 1657
[9] GeminiTeam, 2024, arXiv
[10] From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models
Guo, Jiaxian
Li, Junnan
Li, Dongxu
Tiong, Anthony Meng Huat
Li, Boyang
Tao, Dacheng
Hoi, Steven
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10867 - 10877

← 1 2 3 4 5 6 7 →