HASI: Hierarchical Attention-Aware Spatio-Temporal Interaction for Video-Based Person Re-Identification

被引:2
作者
Chen, Si [1 ,2 ]
Da, Hui [1 ]
Wang, Da-Han [1 ]
Zhang, Xu-Yao [3 ,4 ]
Yan, Yan [5 ]
Zhu, Shunzhi [1 ]
机构
[1] Xiamen Univ Technol, Sch Comp & Informat Engn, Fujian Key Lab Pattern Recognit & Image Understan, Xiamen 361024, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[4] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[5] Xiamen Univ, Sch Informat, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金;
关键词
Video-based person re-identification; vision transformer; spatio-temporal interaction; deep feature fusion; TRANSFORMER; NETWORK; ENHANCEMENT;
D O I
10.1109/TCSVT.2023.3340428
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video-based person re-identification (re-ID) aims to match the same pedestrian of video sequences across non-overlapping cameras. Video re-ID methods generally adopt frame-level feature extraction for different video frames, but they still lack effective spatio-temporal interaction, easily leading to the multi-frame misalignment problem. In this paper, we propose a Hierarchical Attention-aware Spatio-temporal Interaction (HASI) network, including an Attention-aware Temporal Interaction (ATI) module and a Hierarchical Local-spatial Enhancement (HLE) module for video-based person re-ID. In order to avoid the spatial misalignment between video frames, the ATI module employs multiple Frame-to-Frame Temporal Interaction (2FTI) blocks with the Multi-head Inter-frame Alignment Attention (MIAA) to make the current frame iteratively interact with each rest frame of a video in a positive single-cycle manner, rather than only interacting with the adjacent frame or directly building the relationship of all frames at once. This module can not only obtain the long-range non-adjacent temporal information, but also learn the pairwise frame-to-frame relationships. Moreover, the HLE module is designed to enhance the local fine-grained features from multiple Transformer layers, whilst delivering low-level information to further enrich middle-level and high-level semantic knowledge. Thus, our method can learn multi-perspective pedestrian information, including inter-frame long-range interaction information and intra-frame multi-layer global and local information. Extensive experiments demonstrate the superiority of the proposed HASI method compared with the state-of-the-art methods on the three challenging video-based re-ID datasets, i.e., MARS, iLIDS-VID, and PRID-2011.
引用
收藏
页码:4973 / 4988
页数:16
相关论文
共 66 条
[1]   Spatio-Temporal Representation Factorization for Video-based Person Re-Identification [J].
Aich, Abhishek ;
Zheng, Meng ;
Karanam, Srikrishna ;
Chen, Terrence ;
Roy-Chowdhury, Amit K. ;
Wu, Ziyan .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :152-162
[2]   Salient-to-Broad Transition for Video Person Re-identification [J].
Bai, Shutao ;
Ma, Bingpeng ;
Chang, Hong ;
Huang, Rui ;
Chen, Xilin .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :7329-7338
[3]   SANet: Statistic Attention Network for Video-Based Person Re-Identification [J].
Bai, Shutao ;
Ma, Bingpeng ;
Chang, Hong ;
Huang, Rui ;
Shan, Shiguang ;
Chen, Xilin .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (06) :3866-3879
[4]   Video Person Re-Identification Using Attribute-Enhanced Features [J].
Chai, Tianrui ;
Chen, Zhiyuan ;
Li, Annan ;
Chen, Jiaxin ;
Mei, Xinyu ;
Wang, Yunhong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) :7951-7966
[5]   Saliency and Granularity: Discovering Temporal Coherence for Video-Based Person Re-Identification [J].
Chen, Cuiqun ;
Ye, Mang ;
Qi, Meibin ;
Wu, Jingjing ;
Liu, Yimin ;
Jiang, Jianguo .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) :6100-6112
[6]   Learning discriminative features with a dual-constrained guided network for video-based person re-identification [J].
Chen, Cuiqun ;
Qi, Meibin ;
Huang, Guanghong ;
Wu, Jingjing ;
Jiang, Jianguo ;
Li, Xiaohong .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (19) :28673-28696
[7]  
Chen D, 2022, AAAI CONF ARTIF INTE, P239
[8]   Harmonious attention network for person re-identification via complementarity between groups and individuals [J].
Chen, Lin ;
Yang, Hua ;
Xu, Qiling ;
Gao, Zhiyong .
NEUROCOMPUTING, 2021, 453 :766-776
[9]   Salience-Guided Cascaded Suppression Network for Person Re-identification [J].
Chen, Xuesong ;
Fu, Canmiao ;
Zhao, Yong ;
Zheng, Feng ;
Song, Jingkuan ;
Ji, Rongrong ;
Yang, Yi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :3297-3307
[10]   Scale-fusion framework for improving video-based person re-identification performance [J].
Cheng, Li ;
Jing, Xiao-Yuan ;
Zhu, Xiaoke ;
Ma, Fei ;
Hu, Chang-Hui ;
Cai, Ziyun ;
Qi, Fumin .
NEURAL COMPUTING & APPLICATIONS, 2020, 32 (16) :12841-12858