Shifted Chunk Transformer for Spatio-Temporal Representational Learning

被引：0

作者：

Zha, Xuefan ^{[1
]}

Zhu, Wentao ^{[1
]}

Lv, Tingxun ^{[1
]}

Yang, Sen ^{[1
]}

Liu, Ji ^{[1
]}

机构：

[1] Kuaishou Technol, Beijing, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation. Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e.g., LSTM, to learn the intra-frame and interframe features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global video clip. Our shifted self-attention can also effectively model complicated inter-frame variances. Furthermore, we build a clip encoder based on Transformer to model long-term temporal dependencies. We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600, UCF101, and HMDB51.

引用

页数：13

共 55 条

[1] Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190
[2] Akbari Hassan, 2021, ARXIV210411178
[3] Andoni A, 2015, ADV NEUR IN, V28
[4] [Anonymous], 2010, P 18 ACM INT C MULT
[5] [Anonymous], 2020, P IEEE CVF C COMP VI
[6] ViViT: A Video Vision Transformer
Arnab, Anurag
Dehghani, Mostafa
Heigold, Georg
Sun, Chen
Lucic, Mario
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
[7] Bertasius Gedas, 2021, arXiv
[8] Carion N., 2020, EUR C COMP VIS SPRIN, P213
[9] Carreira J., 2018, arXiv
[10] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733

← 1 2 3 4 5 6 →