Cross-modal alignment with synthetic caption for text-based person search

被引:0
|
作者
Weichen Zhao [1 ]
Yuxing Lu [3 ]
Zhiyuan Liu [2 ]
Yuan Yang [3 ]
Ge Jiao [1 ]
机构
[1] Hengyang Normal University,College of Computer Science and Technology
[2] Peking University,College of Future Technology
[3] Soochow University,School of Computer Science and Technology
关键词
Text-based person search; Cross-modal retrieval; Cross-modal alignment; Synthetic caption;
D O I
10.1007/s13735-025-00356-w
中图分类号
学科分类号
摘要
Text-based person search aims to retrieve target person from a large gallery based on natural language description. Existing methods take it as one-to-one embedding or many-to-many embedding matching problem. The former approach relies on the assumption of the existence of strong alignment between text and images, while the latter inevitably leads to issues of intra-class variation. Rather than being confined to these two approaches, we propose a new strategy that achieves cross-modal alignment with synthetic caption for joint image-text-caption optimization, named CASC. The core of this strategy lies in generating fine-grained captions that are informative for multimodal alignment. To realize this, we introduce two novel components: Granularity Awareness Sensor (GAS) and Conditional Contrastive Learning (CCL). GAS selects relative features through an innovative adaptive masking strategy, endowing the model with an enhanced perception of discriminative features. CCL aligns different modalities through further constraints on the synthetic captions by comparing the similarity of hard negative samples, protecting the disruption from noisy contents. With the incorporation of extra caption supervision, the model has access to learn more comprehensive feature representation, which in turn boosts the retrieval performance during inference. Experiments demonstrate that CASC outperforms existing state-of-the-art methods by 1.20%, 2.35% and 2.29% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively.
引用
收藏
相关论文
共 50 条
  • [1] Cross-Modal Feature Fusion-Based Knowledge Transfer for Text-Based Person Search
    You, Kaiyang
    Chen, Wenjing
    Wang, Chengji
    Sun, Hao
    Xie, Wei
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2230 - 2234
  • [2] Asymmetric Cross-Scale Alignment for Text-Based Person Search
    Ji, Zhong
    Hu, Junhua
    Liu, Deyin
    Wu, Lin Yuanbo
    Zhao, Ye
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7699 - 7709
  • [3] Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification
    Gong, Tiantian
    Du, Guodong
    Wang, Junsheng
    Ding, Yongkang
    Zhang, Liyan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5253 - 5261
  • [4] Feature semantic alignment and information supplement for Text-based person search
    Zhou, Hang
    Li, Fan
    Tian, Xuening
    Huang, Yuling
    FRONTIERS IN PHYSICS, 2023, 11
  • [5] Joint Token and Feature Alignment Framework for Text-Based Person Search
    Li, Shangze
    Lu, Andong
    Huang, Yan
    Li, Chenglong
    Wang, Liang
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2238 - 2242
  • [6] CLIP-Based Multi-level Alignment for Text-based Person Search
    Wu, Zhijun
    Ma, Shiwei
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATION, ICCEA 2024, 2024, : 610 - 614
  • [7] Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold
    Wang, Zijie
    Zhu, Aichun
    Xue, Jingyi
    Wan, Xili
    Liu, Chao
    Wang, Tian
    Li, Yifeng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1984 - 1992
  • [8] Full-view salient feature mining and alignment for text-based person search
    Xie, Sheng
    Zhang, Canlong
    Ning, Enhao
    Li, Zhixin
    Wang, Zhiwen
    Wei, Chunrong
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251
  • [9] Text-Guided Visual Feature Refinement for Text-Based Person Search
    Gao, Liying
    Niu, Kai
    Ma, Zehong
    Jiao, Bingliang
    Tan, Tonghao
    Wang, Peng
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 118 - 126
  • [10] Fine-grained semantic oriented embedding set alignment for text-based person search
    Zhao, Jiaqi
    Fu, Ao
    Zhou, Yong
    Du, Wen-liang
    Yao, Rui
    IMAGE AND VISION COMPUTING, 2024, 152