CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL

被引：0

作者：

Chen, Mingliang ^{[1
]}

Zhang, Weimin ^{[2
]}

Ren, Yurui ^{[1
]}

Li, Ge ^{[1
]}

机构：

[1] Peking Univ, Sch Elect & Comp Engn, Shenzhen Grad Sch, Beijing, Peoples R China

[2] AVS Ind Alliance, Beijing, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2022年

基金：

国家重点研发计划;

关键词：

Video-Text Retrieval; Context-aware Hierarchical Transformer;

D O I：

10.1109/ICIP46576.2022.9897206

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video-Text Retrieval aims to perform accurate retrieval process that adopts texts to retrieve the corresponding videos, and vice versa. Typically, mainstream methods solve this problem by learning a common joint embedding space, and then measure the similarities between videos and texts. However, these methods lack the ability to represent detailed semantic information. Therefore, we first utilize three pre-trained models to construct the video embeddings of different semantic levels, and then propose a Context-aware Hierarchical Transformer (CHT) model to encode the context information between these levels. More specifically, our model builds finegrained hierarchical video embeddings of three semantic levels: global, objects, and actions. Attention-based contextual transformers are utilized to establish the context interactions between different semantic levels. Experimental results on two benchmark video-text retrieval datasets demonstrate the superiority of our CHT model. Ablation studies also prove the effectiveness of our proposed model.

引用

页码：386 / 390

页数：5

共 20 条

[1] Memory Enhanced Global-Local Aggregation for Video Object Detection [J].

Chen, Yihong ;

Cao, Yue ;

Hu, Han ;

Wang, Liwei .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10334-10343

[2]

Choo Sungkwon, 2021, 2021 IEEE INT C IM P, P2388

[3]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[4] Dual Encoding for Zero-Example Video Retrieval [J].

Dong, Jianfeng ;

Li, Xirong ;

Xu, Chaoxi ;

Ji, Shouling ;

He, Yuan ;

Yang, Gang ;

Wang, Xun .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347

[5]

Faghri F., 2018, P BRIT MACHINE VISIO

[6] SlowFast Networks for Video Recognition [J].

Feichtenhofer, Christoph ;

Fan, Haoqi ;

Malik, Jitendra ;

He, Kaiming .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210

[7]

Feng ZR, 2020, Arxiv, DOI arXiv:2006.08889

[8] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[9]

Kay W, 2017, Arxiv, DOI [arXiv:1705.06950, DOI 10.48550/ARXIV.1705.06950, 10.48550/arXiv.1705.06950]

[10]

Kim W, 2021, PR MACH LEARN RES, V139

← 1 2 →