Self-Supervised Learning via Multi-Transformation Classification for Action Recognition

被引:1
作者
Duc-Quang Vu [1 ]
Ngan Le [2 ]
Wang, Jia-Ching [3 ]
机构
[1] Thai Nguyen Univ Educ, Dept CSIS, Thai Nguyen, Vietnam
[2] Univ Arkansas, Dept CSCE, Fayetteville, AR 72701 USA
[3] Natl Cent Univ, Dept CSIE, Taoyuan, Taiwan
来源
2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024 | 2024年
关键词
Self-supervised learning; Action Recognition; 3D ResNet; C3D; multi-transformation;
D O I
10.1109/ICMEW63481.2024.10645477
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised tasks have been utilized to build useful representations that can be used in downstream tasks when the annotation is unavailable. In this paper, we introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions. Self-supervised learning on various transformations not only provides richer contextual information but also enables the visual representation more robust to the transforms. The spatio-temporal representation of the video is learned in a self-supervised manner by classifying seven different transformations i.e. rotation, clip inversion, permutation, split, join transformation, color switch, frame replacement, and noise addition. First, seven different video transformations are applied to video clips. Then the 3D convolutional neural networks are utilized to extract features for clips and these features are processed to classify the pseudo-labels. We use the learned models in pretext tasks as the pre-trained models and fine-tune them to recognize human actions in the downstream task. We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks. The experimental results have shown that our proposed framework outperformed other SOTA self-supervised action recognition approaches.
引用
收藏
页数:6
相关论文
共 35 条
[1]   Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition [J].
Ahsan, Unaiza ;
Madhok, Rishi ;
Essa, Irfan .
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :179-189
[2]   Improving Spatiotemporal Self-supervision by Deep Reinforcement Learning [J].
Buechler, Uta ;
Brattoli, Biagio ;
Ommer, Bjoern .
COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 :797-814
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]  
Carreira Joao., 2018, arXiv, DOI DOI 10.48550/ARXIV.1808.01340
[5]   Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning [J].
Gan, Chuang ;
Gong, Boqing ;
Liu, Kun ;
Su, Hao ;
Guibas, Leonidas J. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5589-5597
[6]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7]   Multi-task Self-Supervised Visual Learning [J].
Doersch, Carl ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2070-2079
[8]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[9]   Teaching Yourself: A Self-Knowledge Distillation Approach to Action Recognition [J].
Duc-Quang Vu ;
Le, Ngan ;
Wang, Jia-Ching .
IEEE ACCESS, 2021, 9 :105711-105723
[10]  
El-Nouby A, 2019, Arxiv, DOI arXiv:1910.12770