BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval

被引：4

作者：

Han, Ning ^{[1
]}

Zeng, Yawen ^{[2
]}

Shi, Chuhao ^{[3
]}

Xiao, Guangyi ^{[3
]}

Chen, Hao ^{[3
]}

Chen, Jingjing ^{[4
]}

机构：

[1] Xiangtan Univ, Sch Comp Sci, Xiangtan 411105, Peoples R China

[2] Bytedance AI Lab, 43 North Third Ring West Rd, Beijing 100098, Peoples R China

[3] Hunan Univ, Coll Comp Sci & Elect Engn, 116 Lu Shan South Rd, Changsha 410082, Peoples R China

[4] Fudan Univ, Sch Comp Sci, 20 Handan Rd, Shanghai 200433, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Text-video retrieval; spatio-temporal relation; bi-branch complementary network; IMAGE;

D O I：

10.1145/3627103

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatio-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies Transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatio-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple Transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Global video representations are encoded using a multi-layer Transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at https://github.com/lionel-hing/BiC-Net.

引用

页数：21