Fine-grained Semantics-aware Representation Learning for Text-based Person Retrieval

被引：1

作者：

Wang, Di ^{[1
]}

Yan, Feng ^{[1
]}

Wang, Yifeng ^{[1
]}

Zhao, Lin ^{[2
]}

Liang, Xiao ^{[1
]}

Zhong, Haodi ^{[1
]}

Zhang, Ronghua ^{[3
]}

机构：

[1] Xidian Univ, Xian, Peoples R China

[2] Nanjing Univ Sci & Technol, Nanjing, Peoples R China

[3] Shihezi Univ, Shihezi, Peoples R China

来源：

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

text-based person retrieval; cross-modal retrieval; semantic alignment; self-distillation;

D O I：

10.1145/3652583.3658054

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-based person retrieval aims to search for target persons based on a given text description query. However, existing methods often have the following problems: (1) Ignoring local attribute information between different persons in feature learning, which results in the low distinguishability of similar people's feature representations. (2) Lacking fine-grained semantics alignment between visual images and text descriptions, which leads to inconsistency in person details between query and target. To address these issues, we propose a Fine-grained Semantics-aware Representation Learning (FSRL) method that establishing intra-modal local attribute correlations and inter-modal fine-grained semantic correlations. Specifically, we first design an identity self-distillation module, which explores soft identity labels that reflect local attribute similarities among different people. The soft identity labels assist the model in learning discriminative features associated with fine-grained attributes of persons. Secondly, we propose a visual-language relationship modeling module that enforces the model to proofread "error words" randomly changed in text during the cross-modal interaction process to establish fine-grained image-text semantic correlations. Extensive experiments show that the proposed method achieves new state-of-the-art results on three benchmark datasets and also performs well on the domain generalization task. Our code is available at https://github.com/y416f/FSRL.

引用

页码：92 / 100

页数：9

共 46 条

[1] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[2]

Bai Y, 2023, Arxiv, DOI arXiv:2305.13653

[3] Partial Adversarial Domain Adaptation [J].

Cao, Zhangjie ;

Ma, Lijia ;

Long, Mingsheng ;

Wang, Jianmin .

COMPUTER VISION - ECCV 2018, PT VIII, 2018, 11212 :139-155

[4]

Chen XL, 2015, Arxiv, DOI arXiv:1504.00325

[5] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[6] TIPCB: A simple but effective part-based convolutional baseline for text-based person search [J].

Chen, Yuhao ;

Zhang, Guoqing ;

Lu, Yujiang ;

Wang, Zhenxing ;

Zheng, Yuhui .

NEUROCOMPUTING, 2022, 494 :171-181

[7]

Cho KYHY, 2014, Arxiv, DOI [arXiv:1409.1259, DOI 10.48550/ARXIV.1409.1259, 10.48550/arXiv.1409.1259]

[8]

Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

[9]

Ding ZF, 2021, Arxiv, DOI arXiv:2107.12666

[10]

Ganin Y, 2015, PR MACH LEARN RES, V37, P1180

← 1 2 3 4 5 →