UATVR: Uncertainty-Adaptive Text-Video Retrieval

被引:7
作者
Fang, Bo [1 ]
Wu, Wenhao [2 ,3 ]
Liu, Chang [4 ]
Zhou, Yu [1 ]
Song, Yuxin [3 ]
Wang, Weiping [1 ]
Shu, Xiangbo [5 ]
Ji, Xiangyang [4 ]
Wang, Jingdong [3 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Sydney, Sydney, NSW, Australia
[3] Baidu Inc, Beijing, Peoples R China
[4] Tsinghua Univ, Beijing, Peoples R China
[5] Nanjing Univ Sci & Technol, Nanjing, Peoples R China
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年
关键词
D O I
10.1109/ICCV51070.2023.01262
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each lookup as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.
引用
收藏
页码:13677 / 13687
页数:11
相关论文
共 62 条
[1]  
[Anonymous], 2021, ICCV, DOI DOI 10.1109/ICCV48922.2021.01138
[2]  
[Anonymous], 2020, CVPR, DOI DOI 10.1109/CVPR42600.2020.00575
[3]  
[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.00513
[4]  
[Anonymous], 2021, PMLR
[5]  
[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.01569
[6]  
[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00356
[7]  
[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00831
[8]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[9]  
Ba J. L., 2016, Layer Normalization
[10]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718