Text-based person search by non-saliency enhancing and dynamic label smoothing

被引:0
作者
Pang Y. [1 ]
Zhang C. [1 ,2 ]
Li Z. [1 ,2 ]
Wei C. [2 ]
Wang Z. [3 ]
机构
[1] Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin
[2] Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin
[3] School of Computer Science and Technology, Guangxi University of Science and Technology, Liuzhou
基金
中国国家自然科学基金;
关键词
Cross-modal projection matching; Label smoothing; Person re-identification; Saliency masking;
D O I
10.1007/s00521-024-09691-1
中图分类号
学科分类号
摘要
The current text-based person re-identification (re-ID) models tend to learn salient features of image and text, which however is prone to failure in identifying persons with very similar dress, because their image contents with observable but indescribable difference may have identical textual description. To address this problem, we propose a re-ID model based on saliency masking to learn non-salient but highly discriminative features, which can work together with the salient features to provide more robust pedestrian identification. To further improve the performance of the model, a cross-modal projection matching loss with dynamic label smoothing (named CMPM-DS) is proposed to train our model, and our CMPM-DS can adaptively adjust the smoothing degree of the true distribution. We conduct extensive ablation and comparison experiments on two popular re-ID benchmarks to demonstrate the efficiency of our model and loss function, and our model achieves SOTA, improving the existing best R@1 by 0.33% on CUHK-PEDE and 4.45% on RSTPReID. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:13327 / 13339
页数:12
相关论文
共 53 条
[1]  
Li S., Xiao T., Li H., Yang W., Wang X., Identity-aware textual-visual matching with latent co-attention, Proceedings of the IEEE International Conference on Computer Vision, pp. 1890-1899, (2017)
[2]  
Wang Z., Zhu A., Zheng Z., Jin J., Xue Z., Hua G., IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification, J Electron Imaging, 29, 4, (2020)
[3]  
Zhu A., Wang Z., Li Y., Wan X., Jin J., Wang T., Hu F., Hua G., Dssl: Deep surroundings-person separation learning for text-based person retrieval, Proceedings of the 29Th ACM International Conference on Multimedia, pp. 209-217, (2021)
[4]  
Ding Z., Ding C., Shao Z., Tao D., Semantically Self-Aligned Network for Text-To-Image Part-Aware Person Re-Identification. Arxiv, 2107, (2021)
[5]  
Chen Y., Zhang G., Lu Y., Wang Z., Zheng Y., Tipcb: A simple but effective part-based convolutional baseline for text-based person search, Neurocomputing, 494, pp. 171-181, (2022)
[6]  
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale, (2020)
[7]  
Devlin J., Chang M.-W., Lee K., Toutanova K., Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding, (2018)
[8]  
Zhang Y., Lu H., Deep cross-modal projection learning for image-text matching, Proceedings of the European Conference on Computer Vision (ECCV)., pp. 686-701, (2018)
[9]  
Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z., Rethinking the inception architecture for computer vision, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818-2826, (2016)
[10]  
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I., Attention is all you need, Adv Neural Inf Process Syst, 30, (2017)