Shifted Chunk Transformer for Spatio-Temporal Representational Learning

被引:0
作者
Zha, Xuefan [1 ]
Zhu, Wentao [1 ]
Lv, Tingxun [1 ]
Yang, Sen [1 ]
Liu, Ji [1 ]
机构
[1] Kuaishou Technol, Beijing, Peoples R China
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation. Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e.g., LSTM, to learn the intra-frame and interframe features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global video clip. Our shifted self-attention can also effectively model complicated inter-frame variances. Furthermore, we build a clip encoder based on Transformer to model long-term temporal dependencies. We conduct thorough ablation studies to validate each component and hyper-parameters in our shifted chunk Transformer, and it outperforms previous state-of-the-art approaches on Kinetics-400, Kinetics-600, UCF101, and HMDB51.
引用
收藏
页数:13
相关论文
共 55 条
[21]  
Hendrycks D., 2016, arXiv
[22]  
Kay W., 2017, CoRR, abs/1705.06950, V1705, P06950
[23]   Time-Conditioned Action Anticipation in One Shot [J].
Ke, Qiuhong ;
Fritz, Mario ;
Schiele, Bernt .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9917-9926
[24]  
Kitaev Nikita, 2020, P INT C LEARNING REP
[25]   The Seventh Visual Object Tracking VOT2019 Challenge Results [J].
Kristanl, Matej ;
Matas, Jiri ;
Leonardis, Ales ;
Felsberg, Michael ;
Pflugfelder, Roman ;
Kamarainen, Joni-Kristian ;
Zajc, Luka Cehovin ;
Drbohlav, Ondrej ;
Lukezic, Alan ;
Berg, Amanda ;
Eldesokey, Abdelrahman ;
Kapyla, Jani ;
Fernandez, Gustavo ;
Gonzalez-Garcia, Abel ;
Memarrnoghadam, Alireza ;
Lu, Andong ;
He, Anfeng ;
Varfolomieiev, Anton ;
Chan, Antoni ;
Tripathi, Ardhendu Shekhar ;
Smeulders, Arnold ;
Pedasingu, Bala Suraj ;
Chen, Bao Xin ;
Zhang, Baopeng ;
Wu, Baoyuan ;
Li, Bi ;
He, Bin ;
Yan, Bin ;
Bai, Bing ;
Li, Bing ;
Li, Bo ;
Kim, Bycong Hak ;
Ma, Chao ;
Fang, Chen ;
Qian, Chen ;
Chen, Cheng ;
Li, Chenglong ;
Zhang, Chengquan ;
Tsai, Chi-Yi ;
Luo, Chong ;
Micheloni, Christian ;
Zhang, Chunhui ;
Tao, Dacheng ;
Gupta, Deepak ;
Song, Dejia ;
Wang, Dong ;
Gavves, Efstratios ;
Yi, Eunu ;
Khan, Fahad Shahbaz ;
Zhang, Fangyi .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :2206-2241
[26]   ImageNet Classification with Deep Convolutional Neural Networks [J].
Krizhevsky, Alex ;
Sutskever, Ilya ;
Hinton, Geoffrey E. .
COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90
[27]  
Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543
[28]   Gradient-based learning applied to document recognition [J].
Lecun, Y ;
Bottou, L ;
Bengio, Y ;
Haffner, P .
PROCEEDINGS OF THE IEEE, 1998, 86 (11) :2278-2324
[29]   Global Context-Aware Attention LSTM Networks for 3D Action Recognition [J].
Liu, Jun ;
Wang, Gang ;
Hu, Ping ;
Duan, Ling-Yu ;
Kot, Alex C. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3671-3680
[30]   Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition [J].
Liu, Jun ;
Shahroudy, Amir ;
Xu, Dong ;
Wang, Gang .
COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :816-833