HASI: Hierarchical Attention-Aware Spatio-Temporal Interaction for Video-Based Person Re-Identification

被引：2

作者：

Chen, Si ^{[1
,2
]}

Da, Hui ^{[1
]}

Wang, Da-Han ^{[1
]}

Zhang, Xu-Yao ^{[3
,4
]}

Yan, Yan ^{[5
]}

Zhu, Shunzhi ^{[1
]}

机构：

[1] Xiamen Univ Technol, Sch Comp & Informat Engn, Fujian Key Lab Pattern Recognit & Image Understan, Xiamen 361024, Peoples R China

[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China

[3] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China

[4] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[5] Xiamen Univ, Sch Informat, Xiamen 361005, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Video-based person re-identification; vision transformer; spatio-temporal interaction; deep feature fusion; TRANSFORMER; NETWORK; ENHANCEMENT;

D O I：

10.1109/TCSVT.2023.3340428

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video-based person re-identification (re-ID) aims to match the same pedestrian of video sequences across non-overlapping cameras. Video re-ID methods generally adopt frame-level feature extraction for different video frames, but they still lack effective spatio-temporal interaction, easily leading to the multi-frame misalignment problem. In this paper, we propose a Hierarchical Attention-aware Spatio-temporal Interaction (HASI) network, including an Attention-aware Temporal Interaction (ATI) module and a Hierarchical Local-spatial Enhancement (HLE) module for video-based person re-ID. In order to avoid the spatial misalignment between video frames, the ATI module employs multiple Frame-to-Frame Temporal Interaction (2FTI) blocks with the Multi-head Inter-frame Alignment Attention (MIAA) to make the current frame iteratively interact with each rest frame of a video in a positive single-cycle manner, rather than only interacting with the adjacent frame or directly building the relationship of all frames at once. This module can not only obtain the long-range non-adjacent temporal information, but also learn the pairwise frame-to-frame relationships. Moreover, the HLE module is designed to enhance the local fine-grained features from multiple Transformer layers, whilst delivering low-level information to further enrich middle-level and high-level semantic knowledge. Thus, our method can learn multi-perspective pedestrian information, including inter-frame long-range interaction information and intra-frame multi-layer global and local information. Extensive experiments demonstrate the superiority of the proposed HASI method compared with the state-of-the-art methods on the three challenging video-based re-ID datasets, i.e., MARS, iLIDS-VID, and PRID-2011.

引用

页码：4973 / 4988

页数：16

共 66 条

[1] Spatio-Temporal Representation Factorization for Video-based Person Re-Identification [J].

Aich, Abhishek ;

Zheng, Meng ;

Karanam, Srikrishna ;

Chen, Terrence ;

Roy-Chowdhury, Amit K. ;

Wu, Ziyan .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :152-162

[2] Salient-to-Broad Transition for Video Person Re-identification [J].

Bai, Shutao ;

Ma, Bingpeng ;

Chang, Hong ;

Huang, Rui ;

Chen, Xilin .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :7329-7338

[3] SANet: Statistic Attention Network for Video-Based Person Re-Identification [J].

Bai, Shutao ;

Ma, Bingpeng ;

Chang, Hong ;

Huang, Rui ;

Shan, Shiguang ;

Chen, Xilin .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (06) :3866-3879

[4] Video Person Re-Identification Using Attribute-Enhanced Features [J].

Chai, Tianrui ;

Chen, Zhiyuan ;

Li, Annan ;

Chen, Jiaxin ;

Mei, Xinyu ;

Wang, Yunhong .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) :7951-7966

[5] Saliency and Granularity: Discovering Temporal Coherence for Video-Based Person Re-Identification [J].

Chen, Cuiqun ;

Ye, Mang ;

Qi, Meibin ;

Wu, Jingjing ;

Liu, Yimin ;

Jiang, Jianguo .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) :6100-6112

[6] Learning discriminative features with a dual-constrained guided network for video-based person re-identification [J].

Chen, Cuiqun ;

Qi, Meibin ;

Huang, Guanghong ;

Wu, Jingjing ;

Jiang, Jianguo ;

Li, Xiaohong .

MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (19) :28673-28696

[7]

Chen D, 2022, AAAI CONF ARTIF INTE, P239

[8] Harmonious attention network for person re-identification via complementarity between groups and individuals [J].

Chen, Lin ;

Yang, Hua ;

Xu, Qiling ;

Gao, Zhiyong .

NEUROCOMPUTING, 2021, 453 :766-776

[9] Salience-Guided Cascaded Suppression Network for Person Re-identification [J].

Chen, Xuesong ;

Fu, Canmiao ;

Zhao, Yong ;

Zheng, Feng ;

Song, Jingkuan ;

Ji, Rongrong ;

Yang, Yi .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :3297-3307

[10] Scale-fusion framework for improving video-based person re-identification performance [J].

Cheng, Li ;

Jing, Xiao-Yuan ;

Zhu, Xiaoke ;

Ma, Fei ;

Hu, Chang-Hui ;

Cai, Ziyun ;

Qi, Fumin .

NEURAL COMPUTING & APPLICATIONS, 2020, 32 (16) :12841-12858

← 1 2 3 4 5 6 7 →