UATVR: Uncertainty-Adaptive Text-Video Retrieval

被引：14

作者：

Fang, Bo ^{[1
]}

Wu, Wenhao ^{[2
,3
]}

Liu, Chang ^{[4
]}

Zhou, Yu ^{[1
]}

Song, Yuxin ^{[3
]}

Wang, Weiping ^{[1
]}

Shu, Xiangbo ^{[5
]}

Ji, Xiangyang ^{[4
]}

Wang, Jingdong ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Sydney, Sydney, NSW, Australia

[3] Baidu Inc, Beijing, Peoples R China

[4] Tsinghua Univ, Beijing, Peoples R China

[5] Nanjing Univ Sci & Technol, Nanjing, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.01262

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each lookup as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.

引用

页码：13677 / 13687

页数：11

共 62 条

[1]

[Anonymous], 2020, CVPR, DOI DOI 10.1109/CVPR42600.2020.00575

[2]

[Anonymous], 2021, ICCV, DOI DOI 10.1109/ICCV48922.2021.01138

[3]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00725

[4]

[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.00513

[5]

[Anonymous], 2021, PMLR

[6]

[Anonymous], 2020, CVPR

[7]

[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.01569

[8]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00356

[9]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00831

[10] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

← 1 2 3 4 5 6 7 →