CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

被引:0
作者
Chen, Mingliang [1 ]
Zhang, Weimin [2 ]
Ren, Yurui [1 ]
Li, Ge [1 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn, Shenzhen Grad Sch, Beijing, Peoples R China
[2] AVS Ind Alliance, Beijing, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2022年
基金
国家重点研发计划;
关键词
Video-Text Retrieval; Context-aware Hierarchical Transformer;
D O I
10.1109/ICIP46576.2022.9897206
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-Text Retrieval aims to perform accurate retrieval process that adopts texts to retrieve the corresponding videos, and vice versa. Typically, mainstream methods solve this problem by learning a common joint embedding space, and then measure the similarities between videos and texts. However, these methods lack the ability to represent detailed semantic information. Therefore, we first utilize three pre-trained models to construct the video embeddings of different semantic levels, and then propose a Context-aware Hierarchical Transformer (CHT) model to encode the context information between these levels. More specifically, our model builds finegrained hierarchical video embeddings of three semantic levels: global, objects, and actions. Attention-based contextual transformers are utilized to establish the context interactions between different semantic levels. Experimental results on two benchmark video-text retrieval datasets demonstrate the superiority of our CHT model. Ablation studies also prove the effectiveness of our proposed model.
引用
收藏
页码:386 / 390
页数:5
相关论文
共 20 条
[1]   Memory Enhanced Global-Local Aggregation for Video Object Detection [J].
Chen, Yihong ;
Cao, Yue ;
Hu, Han ;
Wang, Liwei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10334-10343
[2]  
Choo Sungkwon, 2021, 2021 IEEE INT C IM P, P2388
[3]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[4]   Dual Encoding for Zero-Example Video Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Ji, Shouling ;
He, Yuan ;
Yang, Gang ;
Wang, Xun .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347
[5]  
Faghri F., 2018, P BRIT MACHINE VISIO
[6]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[7]  
Feng ZR, 2020, Arxiv, DOI arXiv:2006.08889
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]  
Kay W, 2017, Arxiv, DOI [arXiv:1705.06950, DOI 10.48550/ARXIV.1705.06950, 10.48550/arXiv.1705.06950]
[10]  
Kim W, 2021, PR MACH LEARN RES, V139