Cross-modal alignment with synthetic caption for text-based person search

被引：0

作者：

Weichen Zhao ^{[1
]}

Yuxing Lu ^{[3
]}

Zhiyuan Liu ^{[2
]}

Yuan Yang ^{[3
]}

Ge Jiao ^{[1
]}

机构：

[1] Hengyang Normal University,College of Computer Science and Technology

[2] Peking University,College of Future Technology

[3] Soochow University,School of Computer Science and Technology

来源：

International Journal of Multimedia Information Retrieval | 2025年 / 14卷 / 2期

关键词：

Text-based person search; Cross-modal retrieval; Cross-modal alignment; Synthetic caption;

D O I：

10.1007/s13735-025-00356-w

中图分类号：

学科分类号：

摘要：

Text-based person search aims to retrieve target person from a large gallery based on natural language description. Existing methods take it as one-to-one embedding or many-to-many embedding matching problem. The former approach relies on the assumption of the existence of strong alignment between text and images, while the latter inevitably leads to issues of intra-class variation. Rather than being confined to these two approaches, we propose a new strategy that achieves cross-modal alignment with synthetic caption for joint image-text-caption optimization, named CASC. The core of this strategy lies in generating fine-grained captions that are informative for multimodal alignment. To realize this, we introduce two novel components: Granularity Awareness Sensor (GAS) and Conditional Contrastive Learning (CCL). GAS selects relative features through an innovative adaptive masking strategy, endowing the model with an enhanced perception of discriminative features. CCL aligns different modalities through further constraints on the synthetic captions by comparing the similarity of hard negative samples, protecting the disruption from noisy contents. With the incorporation of extra caption supervision, the model has access to learn more comprehensive feature representation, which in turn boosts the retrieval performance during inference. Experiments demonstrate that CASC outperforms existing state-of-the-art methods by 1.20%, 2.35% and 2.29% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively.

引用

共 50 条

[1] Cross-Modal Feature Fusion-Based Knowledge Transfer for Text-Based Person Search
You, Kaiyang
Chen, Wenjing
Wang, Chengji
Sun, Hao
Xie, Wei
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2230 - 2234
[2] Asymmetric Cross-Scale Alignment for Text-Based Person Search
Ji, Zhong
Hu, Junhua
Liu, Deyin
Wu, Lin Yuanbo
Zhao, Ye
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7699 - 7709
[3] Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification
Gong, Tiantian
Du, Guodong
Wang, Junsheng
Ding, Yongkang
Zhang, Liyan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5253 - 5261
[4] Feature semantic alignment and information supplement for Text-based person search
Zhou, Hang
Li, Fan
Tian, Xuening
Huang, Yuling
FRONTIERS IN PHYSICS, 2023, 11
[5] Joint Token and Feature Alignment Framework for Text-Based Person Search
Li, Shangze
Lu, Andong
Huang, Yan
Li, Chenglong
Wang, Liang
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2238 - 2242
[6] CLIP-Based Multi-level Alignment for Text-based Person Search
Wu, Zhijun
Ma, Shiwei
2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 610 - 614
[7] Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold
Wang, Zijie
Zhu, Aichun
Xue, Jingyi
Wan, Xili
Liu, Chao
Wang, Tian
Li, Yifeng
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1984 - 1992
[8] Full-view salient feature mining and alignment for text-based person search
Xie, Sheng
Zhang, Canlong
Ning, Enhao
Li, Zhixin
Wang, Zhiwen
Wei, Chunrong
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251
[9] Text-Guided Visual Feature Refinement for Text-Based Person Search
Gao, Liying
Niu, Kai
Ma, Zehong
Jiao, Bingliang
Tan, Tonghao
Wang, Peng
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 118 - 126
[10] Fine-grained semantic oriented embedding set alignment for text-based person search
Zhao, Jiaqi
Fu, Ao
Zhou, Yong
Du, Wen-liang
Yao, Rui
IMAGE AND VISION COMPUTING, 2024, 152

← 1 2 3 4 5 →