Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection

被引：0

作者：

Zhou, Siyu ^{[1
]}

Zhang, Fjwei ^{[2
]}

Wang, Ruomei ^{[3
]}

Su, Zhuo ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] North Univ China, Sch Comp Sci & Technol, Taiyuan, Peoples R China

[3] Sun Yat Sen Univ, Sch Software Engn, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X | 2025年 / 15040卷

关键词：

Multimodality; Moment retrieval; Highlight detection;

D O I：

10.1007/978-981-97-8792-0_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video moment retrieval and highlight detection are both text-related tasks in video understanding. Recent works primarily focus on enhancing the interaction between overall video features and query text. However, they overlook the relationships between distinct video modalities and the query text and fuse multi-modal video features in a query-agnostic manner. The overall video features obtained through this fusion method might lose information relevant to the query text, making it difficult to predict results accurately in subsequent reasoning. To address the issue, we introduce a Text-driven Integration Framework (TdIF) to fully leverage the relationships between video modalities and the query text for obtaining the enriched video representation. It fuses multi-modal video features under the guidance of the query text, effectively emphasizing query-related video information. In TdIF, we also design a query-adaptive token to enhance the interaction between the video and the query text. Furthermore, to enrich the semantic information of video representation, we introduce and leverage descriptive text of the video in a simple and efficient manner. Extensive experiments on QVHighlights, Charades-STA, TACoS and TVSum datasets validate the superiority of TdIF.

引用

页码：254 / 268

页数：15