Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

被引：65

作者：

Zhao, Jiaqi ^{[1
]}

Wang, Hanzheng ^{[2
]}

Zhou, Yong ^{[2
]}

Yao, Rui ^{[2
]}

Chen, Silin ^{[2
]}

Saddik, Abdulmotaleb El ^{[3
]}

机构：

[1] China Univ Min & Technol, Innovat Res Ctr Disaster Intelligent Prevent & Eme, Sch Comp Sci & Technol, Xuzhou 221116, Peoples R China

[2] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Jiangsu, Peoples R China

[3] Univ Ottawa, Sch Elect Engn & Comp Sci, Ottawa, ON K1N 5N6, Canada

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

基金：

中国国家自然科学基金;

关键词：

Cross-modality person re-identification; visual Transformer; image retrieval; deep learning;

D O I：

10.1109/TMM.2022.3163847

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visible-infrared person re-identification (VI-ReID) is a challenging task in computer vision, aiming at matching people across images from visible and infrared modalities. The widely used VI-ReID framework consists of a convolution neural backbone network that extracts the visual features, and a feature embedding network to project heterogeneous features to the same feature space. However, many studies based on the existing pre-trained models neglect potential correlations between different locations and channels within a single sample during the feature extraction. Inspired by the success of the Transformer in computer vision, we extend it to enhance feature representation for VI-ReID. In this paper, we propose a discriminative feature learning network based on a visual Transformer (DFLN-ViT) for VI-ReID. Firstly, to capture long-term dependencies between different locations, we propose a spatial feature awareness module (SAM), which utilizes a single-layer Transformer with a novel patch-embedding strategy to encode location information. Secondly, to refine the representation at each channel, we design a channel feature enhancement module (CEM). The CEM treats the features of each channel as a sequence of Transformer inputs, taking advantage of the Transformer's ability to model long-term dependencies. Finally, we propose a Triplet-aided Hetero-Center (THC) loss to learn more discriminative feature representation by balancing the cross-modality distance and intra-modality distance of the center. The experimental results on two datasets show that our method can significantly improve the VI-ReID performance, outperforming most state-of-the-art methods.

引用

页码：3668 / 3680

页数：13

共 41 条

[1]

Ba J L., LAYER NORMALIZATION

[2] An efficient framework for visible-infrared cross modality person re-identification [J].

Basaran, Emrah ;

Gokmen, Muhittin ;

Kamasak, Mustafa E. .

SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 87

[3] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[4]

Chen Z, 2018, PR MACH LEARN RES, V80

[5]

Choi S., 2020, P C COMP VIS PATT RE, p10 257

[6]

Dai PY, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P677

[7]

Dosovitskiy Alexey., 2021, PROC INT C LEARN REP, P2021, DOI [10.48550/arXiv.2010.11929, DOI 10.48550/ARXIV.2010.11929]

[8]

Han K, 2021, ADV NEUR IN

[9] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[10] TransReID: Transformer-based Object Re-Identification [J].

He, Shuting ;

Luo, Hao ;

Wang, Pichao ;

Wang, Fan ;

Li, Hao ;

Jiang, Wei .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14993-15002

← 1 2 3 4 5 →