Robust Video-Text Retrieval Via Noisy Pair Calibration

被引:9
作者
Zhang, Huaiwen [1 ,2 ,3 ]
Yang, Yang [1 ,2 ,3 ]
Qi, Fan [4 ]
Qian, Shengsheng [5 ,6 ]
Xu, Changsheng [5 ,6 ]
机构
[1] Inner Mongolia Univ, Coll Comp Sci, Mongolia 010031, Peoples R China
[2] Natl & Local Joint Engn Res Ctr Intelligent Inform, Mongolia 010031, Peoples R China
[3] Inner Mongolia Key Lab Mongolian Informat Proc Tec, Hohhot 010021, Peoples R China
[4] Tianjin Univ Technol, Sch Comp Sci & Engn, Tianjin 300384, Peoples R China
[5] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
[6] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
基金
中国国家自然科学基金;
关键词
Noise calibration; uncertainty; video text retrieval;
D O I
10.1109/TMM.2023.3239183
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video-text retrieval is a fundamental task in managing the emerging massive amounts of video data. The main challenge focuses on learning a common representation space for videos and queries where the similarity measurement can reflect the semantic closeness. However, existing video-text retrieval models may suffer from the following noise in the common space learning procedure: First, the video-text correspondences in positive pairs may not be exact matches. The crowdsourcing annotation for existing datasets leads to inevitable tagging noise for non-expert annotators. Second, the learning of video-text representation is based on the negative samples randomly sampled. Instances that are semantically similar to the query may be incorrectly categorized as negative samples. To alleviate the adverse impact of these noisy pairs, we propose a novel robust video-text retrieval method that protects the model from noisy positive and negative pairs by identifying and calibrating noisy pairs with their uncertainty score. In particular, we propose a noisy pair identifier, which divides the training dataset into noisy and clean subsets based on the estimated uncertainty of each pair. Then, with the help of uncertainties, we calibrate the two types of noisy pairs with an adaptive margin triplet loss and a weighted triplet loss function, respectively. To verify the effectiveness of our methods, we conduct extensive experiments on three widely used datasets. Experimental results show that the proposed robust video-text retrieval methods successfully identify and calibrate the noisy pairs and improve retrieval performance.
引用
收藏
页码:8632 / 8645
页数:14
相关论文
共 49 条
[1]  
Arazo E, 2019, PR MACH LEARN RES, V97
[2]  
Arpit D, 2017, PR MACH LEARN RES, V70
[3]  
Cao SQ, 2022, AAAI CONF ARTIF INTE, P167
[4]  
Chang HS, 2017, ADV NEUR IN, V30
[5]  
Chen D., 2011, P 49 ANN M ASS COMP, P190
[6]   Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond [J].
Chen, Feiyu ;
Shao, Jie ;
Zhang, Yonghui ;
Xu, Xing ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :3073-3084
[7]   Webly Supervised Learning of Convolutional Networks [J].
Chen, Xinlei ;
Gupta, Abhinav .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1431-1439
[8]   Probabilistic Embeddings for Cross-Modal Retrieval [J].
Chun, Sanghyuk ;
Oh, Seong Joon ;
de Rezende, Rafael Sampaio ;
Kalantidis, Yannis ;
Larlus, Diane .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8411-8420
[9]   TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].
Croitoru, Ioana ;
Bogolin, Simion-Vlad ;
Leordeanu, Marius ;
Jin, Hailin ;
Zisserman, Andrew ;
Albanie, Samuel ;
Liu, Yang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573
[10]   Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval [J].
Dong, Jianfeng ;
Wang, Yabing ;
Chen, Xianke ;
Qu, Xiaoye ;
Li, Xirong ;
He, Yuan ;
Wang, Xun .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) :5680-5694