Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

被引:0
作者
Xiaoyan Zhang
Yujie Cui
Yongkai Huo
机构
[1] Shenzhen University,National Engineering Laboratory for Big Data System Computing Technology, and the Research Institute for Future Media Computing, School of Computer Science and Software Engineering
来源
The Visual Computer | 2023年 / 39卷
关键词
Panoramic; Action recognition; Vision transformer; Temporal shift;
D O I
暂无
中图分类号
学科分类号
摘要
360∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$360^{\circ }$$\end{document} video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a deformable patch embedding-based temporal shift module-enhanced vision transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by equirectangular projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\%}$$\end{document} and an activity accuracy of 8.18%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\%}$$\end{document}, where the recent EgoK360 dataset is employed.
引用
收藏
页码:3247 / 3257
页数:10
相关论文
共 33 条
  • [1] Monroy R(2017)SalNet360: saliency maps for omni-directional images with CNN Signal Process. Image Commun. 69 26-34
  • [2] Lutz S(2014)Two-stream convolutional networks for action recognition in videos Adv. Neural Inf. Process. Syst. 27 568-576
  • [3] Chalasani T(2019)Temporal segment networks for action recognition in videos IEEE Trans. Pattern Anal. Mach. Intell. 41 2740-2755
  • [4] Smolic A(2019)Action-stage emphasized spatiotemporal VLAD for video action recognition IEEE Trans. Image Process. 28 2799-2812
  • [5] Simonyan K(2020)Unsupervised learning of optical flow with CNN-based non-local filtering IEEE Trans. Image Process. 29 8429-8442
  • [6] Zisserman A(2017)Attention is all you need Adv. Neural Inf. Process. Syst. 30 6000-6010
  • [7] Wang L(undefined)undefined undefined undefined undefined-undefined
  • [8] Xiong Y(undefined)undefined undefined undefined undefined-undefined
  • [9] Wang Z(undefined)undefined undefined undefined undefined-undefined
  • [10] Qiao Y(undefined)undefined undefined undefined undefined-undefined