BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video Retrieval

被引:4
|
作者
Han, Ning [1 ]
Zeng, Yawen [2 ]
Shi, Chuhao [3 ]
Xiao, Guangyi [3 ]
Chen, Hao [3 ]
Chen, Jingjing [4 ]
机构
[1] Xiangtan Univ, Sch Comp Sci, Xiangtan 411105, Peoples R China
[2] Bytedance AI Lab, 43 North Third Ring West Rd, Beijing 100098, Peoples R China
[3] Hunan Univ, Coll Comp Sci & Elect Engn, 116 Lu Shan South Rd, Changsha 410082, Peoples R China
[4] Fudan Univ, Sch Comp Sci, 20 Handan Rd, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-video retrieval; spatio-temporal relation; bi-branch complementary network; IMAGE;
D O I
10.1145/3627103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of text-video retrieval aims to understand the correspondence between language and vision and has gained increasing attention in recent years. Recent works have demonstrated the superiority of local spatio-temporal relation learning with graph-based models. However, most existing graph-based models are handcrafted and depend heavily on expert knowledge and empirical feedback, which may be unable to mine the high-level fine-grained visual relations effectively. These limitations result in their inability to distinguish videos with the same visual components but different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies Transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatio-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple Transformer blocks and additional residual blocks to learn fine-grained spatio-temporal relations and long-term temporal dependency, calling the module a Fine-grained Spatio-temporal Transformer (FST). Global video representations are encoded using a multi-layer Transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval. Extensive experiments are conducted on MSR-VTT, MSVD, and YouCook2 datasets. The results demonstrate the effectiveness of our proposed model. Our code is public at https://github.com/lionel-hing/BiC-Net.
引用
收藏
页数:21
相关论文
empty
未找到相关数据