Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

被引:73
作者
Shao, Zhiyin [1 ]
Zhang, Xinyu [2 ]
Fang, Meng [3 ]
Lin, Zhifeng [1 ]
Wang, Jian [2 ]
Ding, Changxing [1 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Baidu VIS, Beijing, Peoples R China
[3] Univ Liverpool, Liverpool, Merseyside, England
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
中国国家自然科学基金;
关键词
Person Re-identification; Text-to-image Retrieval;
D O I
10.1145/3503161.3548028
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions. It is challenging due to both rich intra-modal variations and significant inter-modal gaps. Existing works usually ignore the difference in feature granularity between the two modalities, i.e., the visual features are usually fine-grained while textual features are coarse, which is mainly responsible for the large inter-modal gaps. In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. LGUR framework contains two modules: a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. Besides, DGA has two important factors, i.e., the cross-modality guidance and the foreground-centric reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance. Comprehensive experiments show that our LGUR consistently outperforms state-of-the-arts by large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released at https://github.com/ZhiyinShao-H/LGUR.
引用
收藏
页码:5566 / 5574
页数:9
相关论文
共 49 条
  • [31] Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)
    Sun, Yifan
    Zheng, Liang
    Yang, Yi
    Tian, Qi
    Wang, Shengjin
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 501 - 518
  • [32] Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100
  • [33] Touvron H, 2021, PR MACH LEARN RES, V139, P7358
  • [34] Vaswani A, 2017, ADV NEUR IN, V30
  • [35] Wang C., 2021, IJCAI
  • [36] Learning Discriminative Features with Multiple Granularities for Person Re-Identification
    Wang, Guanshuo
    Yuan, Yufeng
    Chen, Xiong
    Li, Jiwei
    Zhou, Xi
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 274 - 282
  • [37] Batch Coherence-Driven Network for Part-Aware Person Re-Identification
    Wang, Kan
    Wang, Pengfei
    Ding, Changxing
    Tao, Dacheng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 3405 - 3418
  • [38] Wang P., 2022, IEEE Transactions on Multimedia
  • [39] Wang Pengfei, 2022, IEEE T MULTIMEDIA
  • [40] Wang Zhe, 2020, VITAA VISUAL TEXTUAL