Fine-grained Semantics-aware Representation Learning for Text-based Person Retrieval

被引:1
作者
Wang, Di [1 ]
Yan, Feng [1 ]
Wang, Yifeng [1 ]
Zhao, Lin [2 ]
Liang, Xiao [1 ]
Zhong, Haodi [1 ]
Zhang, Ronghua [3 ]
机构
[1] Xidian Univ, Xian, Peoples R China
[2] Nanjing Univ Sci & Technol, Nanjing, Peoples R China
[3] Shihezi Univ, Shihezi, Peoples R China
来源
PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
text-based person retrieval; cross-modal retrieval; semantic alignment; self-distillation;
D O I
10.1145/3652583.3658054
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-based person retrieval aims to search for target persons based on a given text description query. However, existing methods often have the following problems: (1) Ignoring local attribute information between different persons in feature learning, which results in the low distinguishability of similar people's feature representations. (2) Lacking fine-grained semantics alignment between visual images and text descriptions, which leads to inconsistency in person details between query and target. To address these issues, we propose a Fine-grained Semantics-aware Representation Learning (FSRL) method that establishing intra-modal local attribute correlations and inter-modal fine-grained semantic correlations. Specifically, we first design an identity self-distillation module, which explores soft identity labels that reflect local attribute similarities among different people. The soft identity labels assist the model in learning discriminative features associated with fine-grained attributes of persons. Secondly, we propose a visual-language relationship modeling module that enforces the model to proofread "error words" randomly changed in text during the cross-modal interaction process to establish fine-grained image-text semantic correlations. Extensive experiments show that the proposed method achieves new state-of-the-art results on three benchmark datasets and also performs well on the domain generalization task. Our code is available at https://github.com/y416f/FSRL.
引用
收藏
页码:92 / 100
页数:9
相关论文
共 46 条
[1]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[2]  
Bai Y, 2023, Arxiv, DOI arXiv:2305.13653
[3]   Partial Adversarial Domain Adaptation [J].
Cao, Zhangjie ;
Ma, Lijia ;
Long, Mingsheng ;
Wang, Jianmin .
COMPUTER VISION - ECCV 2018, PT VIII, 2018, 11212 :139-155
[4]  
Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
[5]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[6]   TIPCB: A simple but effective part-based convolutional baseline for text-based person search [J].
Chen, Yuhao ;
Zhang, Guoqing ;
Lu, Yujiang ;
Wang, Zhenxing ;
Zheng, Yuhui .
NEUROCOMPUTING, 2022, 494 :171-181
[7]  
Cho KYHY, 2014, Arxiv, DOI [arXiv:1409.1259, DOI 10.48550/ARXIV.1409.1259]
[8]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9]  
Ding ZF, 2021, Arxiv, DOI arXiv:2107.12666
[10]  
Ganin Y, 2015, PR MACH LEARN RES, V37, P1180