Contrastive Transformer Learning With Proximity Data Generation for Text-Based Person Search

被引:8
作者
Wu, Hefeng [1 ]
Chen, Weifeng [1 ]
Liu, Zhibin [1 ]
Chen, Tianshui [2 ]
Chen, Zhiguang [3 ,4 ]
Lin, Liang [1 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangdong Prov Key Lab Informat Secur Technol, Guangzhou 510006, Peoples R China
[2] Guangdong Univ Technol, Sch Informat Engn, Guangzhou 510006, Peoples R China
[3] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou 510006, Peoples R China
[4] Sun Yat Sen Univ, Natl Supercomp Ctr Guangzhou, Guangzhou 510006, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-based person search; transformer; contrastive learning; proximity data generation;
D O I
10.1109/TCSVT.2023.3329220
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Given a descriptive text query, text-based person search (TBPS) aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. To better align the two modalities, most existing works focus on introducing sophisticated network structures and auxiliary tasks, which are complex and hard to implement. In this paper, we propose a simple yet effective dual Transformer model for text-based person search. By exploiting a hardness-aware contrastive learning strategy, our model achieves state-of-the-art performance without any special design for local feature alignment or side information. Moreover, we propose a proximity data generation (PDG) module to automatically produce more diverse data for cross-modal training. The PDG module first introduces an automatic generation algorithm based on a text-to-image diffusion model, which generates new text-image pair samples in the proximity space of original ones. Then it combines approximate text generation and feature-level mixup during training to further strengthen the data diversity. The PDG module can largely guarantee the reasonability of the generated samples that are directly used for training without any human inspection for noise rejection. It improves the performance of our model significantly, providing a feasible solution to the data insufficiency problem faced by such fine-grained visual-linguistic tasks. Extensive experiments on two popular datasets of the TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%, 4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES.
引用
收藏
页码:7005 / 7016
页数:12
相关论文
共 51 条
  • [1] Aggarwal S, 2020, IEEE WINT CONF APPL, P2606, DOI [10.1109/WACV45572.2020.9093640, 10.1109/wacv45572.2020.9093640]
  • [2] FlyMap: Interacting with Maps Projected from a Drone
    Brock, Anke M.
    Chatain, Julia
    Park, Michelle
    Fang, Tommy
    Hachet, Martin
    Landay, James A.
    Cauchard, Jessica R.
    [J]. PROCEEDINGS PERVASIVE DISPLAYS 2018: THE 7TH ACM INTERNATIONAL SYMPOSIUM ON PERVASIVE DISPLAYS, 2018,
  • [3] Knowledge-Guided Multi-Label Few-Shot Learning for General Image Recognition
    Chen, Tianshui
    Lin, Liang
    Chen, Riquan
    Hui, Xiaolu
    Wu, Hefeng
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (03) : 1371 - 1384
  • [4] Chen WF, 2024, Arxiv, DOI arXiv:2305.13840
  • [5] TIPCB: A simple but effective part-based convolutional baseline for text-based person search
    Chen, Yuhao
    Zhang, Guoqing
    Lu, Yujiang
    Wang, Zhenxing
    Zheng, Yuhui
    [J]. NEUROCOMPUTING, 2022, 494 : 171 - 181
  • [6] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [7] Ding ZF, 2021, Arxiv, DOI [arXiv:2107.12666, DOI 10.48550/ARXIV.2107.12666]
  • [8] Dosovitskiy A., 2021, ICLR, P1
  • [9] Farooq A, 2022, Arxiv, DOI arXiv:2101.08238
  • [10] Gao CY, 2021, Arxiv, DOI arXiv:2101.03036