Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval

被引：3

作者：

Li, Jiayi ^{[1
]}

Jiang, Min ^{[1
]}

Kong, Jun ^{[2
]}

Tao, Xuefeng ^{[2
]}

Luo, Xi ^{[1
]}

机构：

[1] Jiangnan Univ, Engn Res Ctr Intelligent Technol Healthcare, Minist Educ, Wuxi 214122, Peoples R China

[2] Jiangnan Univ, Key Lab Adv Proc Control Light Ind, Minist Educ, Wuxi 214122, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Semantics; Image reconstruction; Feature extraction; Task analysis; Cognition; Measurement; Representation learning; Text-based person retrieval; semantic polymorphism; implicit reasoning; modality alignment;

D O I：

10.1109/TMM.2024.3410129

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text-Based Person Retrieval (TBPR) aims to identify a particular individual within an extensive image gallery using text as the query. The principal challenge inherent in the TBPR task revolves around how to map cross-modal information to a potential common space and learn a generic representation. Previous methods have primarily focused on aligning singular text-image pairs, disregarding the inherent polymorphism within both images and natural language expressions for the same individual. Moreover, these methods have also ignored the impact of semantic polymorphism-based intra-modal data distribution on cross-modal matching. Recent methods employ cross-modal implicit information reconstruction to enhance inter-modal connections. However, the process of information reconstruction remains ambiguous. To address these issues, we propose the Learning Semantic Polymorphic Mapping (LSPM) framework, facilitated by the prowess of pre-trained cross-modal models. Firstly, to learn cross-modal information representations with better robustness, we design the Inter-modal Information Aggregation (Inter-IA) module to achieve cross-modal polymorphic mapping, fortifying the foundation of our information representations. Secondly, to attain a more concentrated intra-modal information representation based on semantic polymorphism, we design Intra-modal Information Aggregation (Intra-IA) module to further constrain the embeddings. Thirdly, to further explore the potential of cross-modal interactions within the model, we design the implicit reasoning module, Masked Information Guided Reconstruction (MIGR), with constraint guidance to elevate overall performance. Extensive experiments on both CUHK-PEDES and ICFG-PEDES datasets show that we achieve state-of-the-art results on Rank-1, mAP and mINP compared to existing methods.

引用

页码：10678 / 10691

页数：14

共 64 条

[1]

Bao HB, 2022, ADV NEUR IN

[2] Improving Text-based Person Search by Spatial Matching and Adaptive Threshold [J].

Chen, Tianlang ;

Xu, Chenliang ;

Luo, Jiebo .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :1879-1887

[3] Cross-Modal Knowledge Adaptation for Language-Based Person Search [J].

Chen, Yucheng ;

Huang, Rui ;

Chang, Hong ;

Tan, Chuanqi ;

Xue, Tao ;

Ma, Bingpeng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :4057-4069

[4] TIPCB: A simple but effective part-based convolutional baseline for text-based person search [J].

Chen, Yuhao ;

Zhang, Guoqing ;

Lu, Yujiang ;

Wang, Zhenxing ;

Zheng, Yuhui .

NEUROCOMPUTING, 2022, 494 :171-181

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6]

Ding ZF, 2021, Arxiv, DOI arXiv:2107.12666

[7]

Dosovitskiy A., 2021, 9 INT C LEARN REPR I

[8]

Farooq A, 2022, AAAI CONF ARTIF INTE, P4477

[9] Learning Modality-Specific Representations for Visible-Infrared Person Re-Identification [J].