An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition

被引：35

作者：

Alfasly, Saghir ^{[1
,2
]}

Chui, Charles K. ^{[3
]}

Jiang, Qingtang ^{[4
]}

Lu, Jian ^{[1
,5
]}

Xu, Chen ^{[1
,2
]}

机构：

[1] Shenzhen Univ, Coll Math & Stat, Shenzhen Key Lab Adv Machine Learning & Applicat, Shenzhen 518060, Peoples R China

[2] Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China

[3] Stanford Univ, Dept Stat, Stanford, CA 94305 USA

[4] Univ Missouri, Dept Math & Stat, St Louis, MO 63121 USA

[5] Natl Ctr Appl Math Shenzhen NCAMS, Shenzhen 518055, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2024年 / 35卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Transformers; Training; Spatiotemporal phenomena; Adaptation models; Image recognition; Computational modeling; Synchronization; Action recognition; frame interlacing; motion spotlighting; video augmentation; video transformers (VidTrs);

D O I：

10.1109/TNNLS.2022.3190367

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Convolutional neural networks (CNNs) have come to dominate vision-based deep neural network structures in both image and video models over the past decade. However, convolution-free vision Transformers (ViTs) have recently outperformed CNN-based models in image recognition. Despite this progress, building and designing video Transformers (video transformer) have not yet obtained the same attention in research as image-based Transformers. While there have been attempts to build video transformers by adapting image-based Transformers for video understanding, these Transformers still lack efficiency due to the large gap between CNN-based models and Transformers regarding the number of parameters and the training settings. In this work, we propose three techniques to improve video understanding with video transformers. First, to derive better spatiotemporal feature representation, we propose a new spatiotemporal attention scheme, termed synchronized spatiotemporal and spatial attention (SSTSA), which derives the spatiotemporal features with temporal and spatial multiheaded self-attention (MSA) modules. It also preserves the best spatial attention by another spatial self-attention module in parallel, thereby resulting in an effective Transformer encoder. Second, a motion spotlighting module is proposed to embed the short-term motion of the consecutive input frames to the regular RGB input, which is then processed with a single-stream video transformer. Third, a simple intraclass frame interlacing method of the input clips is proposed that serves as an effective video augmentation method. Finally, our proposed techniques have been evaluated and validated with a set of extensive experiments in this study. Our video transformer outperforms its previous counterparts on two well-known datasets, Kinetics400 and Something-Something-v2.

引用

页码：2496 / 2509

页数：14

共 54 条

[51]

Zhang H., 2018, PROC COMPUT VIS PATT, P1

[52] AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization [J].