Multi-Granularity Matching Transformer for Text-Based Person Search

被引:5
作者
Bao, Liping [1 ]
Wei, Longhui [2 ]
Zhou, Wengang [1 ]
Liu, Lin [1 ]
Xie, Lingxi [3 ]
Li, Houqiang [1 ]
Tian, Qi [3 ]
机构
[1] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Peoples R China
[2] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230027, Peoples R China
[3] Huawei Cloud, Shenzhen 518129, Peoples R China
关键词
Transformers; Feature extraction; Task analysis; Pedestrians; Visualization; Search problems; Training; Text-based person search; transformer; vision-language pre-trained model; REIDENTIFICATION; ALIGNMENT;
D O I
10.1109/TMM.2023.3321504
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-based person search aims to retrieve the most relevant pedestrian images from an image gallery based on textual descriptions. Most existing methods rely on two separate encoders to extract the image and text features, and then elaborately design various schemes to bridge the gap between image and text modalities. However, the shallow interaction between both modalities in these methods is still insufficient to eliminate the modality gap. To address the above problem, we propose TransTPS, a transformer-based framework that enables deeper interaction between both modalities through the self-attention mechanism in transformer, effectively alleviating the modality gap. In addition, due to the small inter-class variance and large intra-class variance in image modality, we further develop two techniques to overcome these limitations. Specifically, Cross-modal Multi-Granularity Matching (CMGM) is proposed to address the problem caused by small inter-class variance and facilitate distinguishing pedestrians with similar appearance. Besides, Contrastive Loss with Weakly Positive pairs (CLWP) is introduced to mitigate the impact of large intra-class variance and contribute to the retrieval of more target images. Experiments on CUHK-PEDES and RSTPReID datasets demonstrate that our proposed framework achieves state-of-the-art performance compared to previous methods.
引用
收藏
页码:4281 / 4293
页数:13
相关论文
共 58 条
[1]  
Aggarwal S, 2020, IEEE WINT CONF APPL, P2606, DOI [10.1109/wacv45572.2020.9093640, 10.1109/WACV45572.2020.9093640]
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]  
Cao YT, 2020, Img Proc Comp Vis Re, V12359, P230, DOI 10.1007/978-3-030-58568-6_14
[4]   RCAA: Relational Context-Aware Agents for Person Search [J].
Chang, Xiaojun ;
Huang, Po-Yao ;
Shen, Yi-Dong ;
Liang, Xiaodan ;
Yang, Yi ;
Hauptmann, Alexander G. .
COMPUTER VISION - ECCV 2018, PT IX, 2018, 11213 :86-102
[5]   Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association [J].
Chen, Dapeng ;
Li, Hongsheng ;
Liu, Xihui ;
Shen, Yantao ;
Shao, Jing ;
Yuan, Zejian ;
Wang, Xiaogang .
COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 :56-73
[6]  
Chen T, 2020, PR MACH LEARN RES, V119
[7]  
Chen YC, 2019, AEBMR ADV ECON, V106, P104, DOI 10.1007/978-3-030-58577-8_7
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Ding ZF, 2021, Arxiv, DOI [arXiv:2107.12666, DOI 10.48550/ARXIV.2107.12666]
[10]  
Dosovitskiy A., 2021, P INT C LEARN REPR, P1