Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection
被引:0
作者:
Zhou, Siyu
论文数: 0引用数: 0
h-index: 0
机构:
Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R ChinaSun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China
Zhou, Siyu
[1
]
Zhang, Fjwei
论文数: 0引用数: 0
h-index: 0
机构:
North Univ China, Sch Comp Sci & Technol, Taiyuan, Peoples R ChinaSun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China
Zhang, Fjwei
[2
]
Wang, Ruomei
论文数: 0引用数: 0
h-index: 0
机构:
Sun Yat Sen Univ, Sch Software Engn, Guangzhou, Peoples R ChinaSun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China
Wang, Ruomei
[3
]
Su, Zhuo
论文数: 0引用数: 0
h-index: 0
机构:
Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R ChinaSun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China
Su, Zhuo
[1
]
机构:
[1] Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] North Univ China, Sch Comp Sci & Technol, Taiyuan, Peoples R China
[3] Sun Yat Sen Univ, Sch Software Engn, Guangzhou, Peoples R China
来源:
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X
|
2025年
/
15040卷
关键词:
Multimodality;
Moment retrieval;
Highlight detection;
D O I:
10.1007/978-981-97-8792-0_18
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
Video moment retrieval and highlight detection are both text-related tasks in video understanding. Recent works primarily focus on enhancing the interaction between overall video features and query text. However, they overlook the relationships between distinct video modalities and the query text and fuse multi-modal video features in a query-agnostic manner. The overall video features obtained through this fusion method might lose information relevant to the query text, making it difficult to predict results accurately in subsequent reasoning. To address the issue, we introduce a Text-driven Integration Framework (TdIF) to fully leverage the relationships between video modalities and the query text for obtaining the enriched video representation. It fuses multi-modal video features under the guidance of the query text, effectively emphasizing query-related video information. In TdIF, we also design a query-adaptive token to enhance the interaction between the video and the query text. Furthermore, to enrich the semantic information of video representation, we introduce and leverage descriptive text of the video in a simple and efficient manner. Extensive experiments on QVHighlights, Charades-STA, TACoS and TVSum datasets validate the superiority of TdIF.