Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

被引:65
作者
Zhao, Jiaqi [1 ]
Wang, Hanzheng [2 ]
Zhou, Yong [2 ]
Yao, Rui [2 ]
Chen, Silin [2 ]
Saddik, Abdulmotaleb El [3 ]
机构
[1] China Univ Min & Technol, Innovat Res Ctr Disaster Intelligent Prevent & Eme, Sch Comp Sci & Technol, Xuzhou 221116, Peoples R China
[2] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Jiangsu, Peoples R China
[3] Univ Ottawa, Sch Elect Engn & Comp Sci, Ottawa, ON K1N 5N6, Canada
基金
中国国家自然科学基金;
关键词
Cross-modality person re-identification; visual Transformer; image retrieval; deep learning;
D O I
10.1109/TMM.2022.3163847
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visible-infrared person re-identification (VI-ReID) is a challenging task in computer vision, aiming at matching people across images from visible and infrared modalities. The widely used VI-ReID framework consists of a convolution neural backbone network that extracts the visual features, and a feature embedding network to project heterogeneous features to the same feature space. However, many studies based on the existing pre-trained models neglect potential correlations between different locations and channels within a single sample during the feature extraction. Inspired by the success of the Transformer in computer vision, we extend it to enhance feature representation for VI-ReID. In this paper, we propose a discriminative feature learning network based on a visual Transformer (DFLN-ViT) for VI-ReID. Firstly, to capture long-term dependencies between different locations, we propose a spatial feature awareness module (SAM), which utilizes a single-layer Transformer with a novel patch-embedding strategy to encode location information. Secondly, to refine the representation at each channel, we design a channel feature enhancement module (CEM). The CEM treats the features of each channel as a sequence of Transformer inputs, taking advantage of the Transformer's ability to model long-term dependencies. Finally, we propose a Triplet-aided Hetero-Center (THC) loss to learn more discriminative feature representation by balancing the cross-modality distance and intra-modality distance of the center. The experimental results on two datasets show that our method can significantly improve the VI-ReID performance, outperforming most state-of-the-art methods.
引用
收藏
页码:3668 / 3680
页数:13
相关论文
共 41 条
[1]  
Ba J L., LAYER NORMALIZATION
[2]   An efficient framework for visible-infrared cross modality person re-identification [J].
Basaran, Emrah ;
Gokmen, Muhittin ;
Kamasak, Mustafa E. .
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 87
[3]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[4]  
Chen Z, 2018, PR MACH LEARN RES, V80
[5]  
Choi S., 2020, P C COMP VIS PATT RE, p10 257
[6]  
Dai PY, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P677
[7]  
Dosovitskiy Alexey., 2021, PROC INT C LEARN REP, P2021, DOI [10.48550/arXiv.2010.11929, DOI 10.48550/ARXIV.2010.11929]
[8]  
Han K, 2021, ADV NEUR IN
[9]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[10]   TransReID: Transformer-based Object Re-Identification [J].
He, Shuting ;
Luo, Hao ;
Wang, Pichao ;
Wang, Fan ;
Li, Hao ;
Jiang, Wei .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14993-15002