Noisy-Correspondence Learning for Text-to-Image Person Re-identification

被引：19

作者：

Qin, Yang ^{[1
]}

Chen, Yingke ^{[2
]}

Peng, Dezhong ^{[1
,5
,6
]}

Peng, Xi ^{[1
]}

Zhou, Joey Tianyi ^{[3
,4
]}

Hu, Peng ^{[1
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610095, Peoples R China

[2] Northumbria Univ, Dept Comp & Informat Sci, Newcastle, NSW NE1, Australia

[3] ASTAR, Ctr Frontier Res CFAR, Singapore, Singapore

[4] ASTAR, Inst High Performance Comp 1HPC, Singapore, Singapore

[5] SichuanNewstrong UHD Video Technol Co Ltd, Chengdu 610095, Peoples R China

[6] Chengdu Ruibei Yingte Informat Technol Co Ltd, Chengdu 610065, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.02568

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.

引用

页码：27187 / 27196

页数：10

共 51 条

[1]

[Anonymous], 2021, IJCAI

[2]

[Anonymous], 2017, PMLR

[3] Text-based Person Search without Parallel Image-Text Data [J].

Bai, Yang ;

Wang, Jingyao ;

Cao, Min ;

Chen, Chen ;

Cao, Ziqiang ;

Nie, Liqiang .

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :757-767

[4]

Bai Y, 2023, PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, P555

[5]

Cao Min, 2023, ARXIV

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7]

Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218

[8]

Ding Z., 2021, ARXIV

[9] Dual Encoding for Video Retrieval by Text [J].

Dong, Jianfeng ;

Li, Xirong ;

Xu, Chaoxi ;

Yang, Xun ;

Yang, Gang ;

Wang, Xun ;

Wang, Meng .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (08) :4065-4080

[10]

Dosovitskiy A, 2020, INT C LEARN REPR

← 1 2 3 4 5 6 →