Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

被引:1
作者
Zhang, Xiaoyan [1 ,2 ]
Cui, Yujie [1 ,2 ]
Huo, Yongkai [1 ,2 ]
机构
[1] Shenzhen Univ, Sch Comp Sci & Software Engn, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Res Inst Future Media Comp, Sch Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Panoramic; Action recognition; Vision transformer; Temporal shift;
D O I
10.1007/s00371-023-02959-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
360(?) video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a deformable patch embedding-based temporal shift module-enhanced vision transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by equirectangular projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29% and an activity accuracy of 8.18%, where the recent EgoK360 dataset is employed.
引用
收藏
页码:3247 / 3257
页数:11
相关论文
共 44 条
[1]  
Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[4]  
Bhandari K, 2020, IEEE IMAGE PROC, P266, DOI 10.1109/ICIP40778.2020.9191256
[5]   High accuracy optical flow estimation based on a theory for warping [J].
Brox, T ;
Bruhn, A ;
Papenberg, N ;
Weickert, J .
COMPUTER VISION - ECCV 2004, PT 4, 2004, 2034 :25-36
[6]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[7]   DPT: Deformable Patch-based Transformer for Visual Recognition [J].
Chen, Zhiyang ;
Zhu, Yousong ;
Zhao, Chaoyang ;
Hu, Guosheng ;
Zeng, Wei ;
Wang, Jinqiao ;
Tang, Ming .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :2899-2907
[8]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[9]  
Dosovitskiy Alexey, 2021, P ICLR
[10]   Tangent Images for Mitigating Spherical Distortion [J].
Eder, Marc ;
Shvets, Mykhailo ;
Lim, John ;
Frahm, Jan-Michael .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12423-12431