Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

被引：94

作者：

Shao, Zhiyin ^{[1
]}

Zhang, Xinyu ^{[2
]}

Fang, Meng ^{[3
]}

Lin, Zhifeng ^{[1
]}

Wang, Jian ^{[2
]}

Ding, Changxing ^{[1
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

[2] Baidu VIS, Beijing, Peoples R China

[3] Univ Liverpool, Liverpool, Merseyside, England

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

Person Re-identification; Text-to-image Retrieval;

D O I：

10.1145/3503161.3548028

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions. It is challenging due to both rich intra-modal variations and significant inter-modal gaps. Existing works usually ignore the difference in feature granularity between the two modalities, i.e., the visual features are usually fine-grained while textual features are coarse, which is mainly responsible for the large inter-modal gaps. In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. LGUR framework contains two modules: a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. Besides, DGA has two important factors, i.e., the cross-modality guidance and the foreground-centric reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance. Comprehensive experiments show that our LGUR consistently outperforms state-of-the-arts by large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released at https://github.com/ZhiyinShao-H/LGUR.

引用

页码：5566 / 5574

页数：9

共 49 条

[31] VideoBERT: A Joint Model for Video and Language Representation Learning [J].

Sun, Chen ;

Myers, Austin ;

Vondrick, Carl ;

Murphy, Kevin ;

Schmid, Cordelia .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7463-7472

[32] Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline) [J].

Sun, Yifan ;

Zheng, Liang ;

Yang, Yi ;

Tian, Qi ;

Wang, Shengjin .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :501-518

[33]

Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100

[34]

Touvron H, 2021, PR MACH LEARN RES, V139, P7358

[35]

Vaswani A, 2017, ADV NEUR IN, V30

[36]

Wang Chengji, 2021, IJCAI

[37] Learning Discriminative Features with Multiple Granularities for Person Re-Identification [J].

Wang, Guanshuo ;

Yuan, Yufeng ;

Chen, Xiong ;

Li, Jiwei ;

Zhou, Xi .

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :274-282

[38] Batch Coherence-Driven Network for Part-Aware Person Re-Identification [J].

Wang, Kan ;

Wang, Pengfei ;

Ding, Changxing ;

Tao, Dacheng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :3405-3418

[39]

Wang P, 2022, IEEE Transactions on Multimedia

[40]

Wang Pengfei, 2022, IEEE T MULTIMEDIA

← 1 2 3 4 5 →