Semi-Supervised Action Recognition From Temporal Augmentation Using Curriculum Learning

被引:20
作者
Tong, Anyang [1 ]
Tang, Chao [1 ]
Wang, Wenjian [2 ]
机构
[1] Hefei Univ, Sch Artificial Intelligence & Big Data, Hefei 230601, Peoples R China
[2] Shanxi Univ, Sch Comp & Informat Technol, Key Lab Comp Intelligence & Chinese Informat Proc, Minist Educ, Taiyuan 030006, Peoples R China
关键词
Training; Labeling; Data models; Feature extraction; Noise measurement; Image recognition; Image classification; Action recognition; curriculum learning; semi-supervised learning; temporal augmentation; NETWORKS;
D O I
10.1109/TCSVT.2022.3210271
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Semi-supervised learning for video action recognition is a very challenging research area. Existing state-of-the-art methods perform data augmentation on the temporality of actions, which are combined with the mainstream consistency-based semi-supervised learning framework FixMatch for action recognition. However, these approaches have the following limitations: (1) data augmentation based on video clips lacks coarse-grained and fine-grained representations of actions in temporal sequences, and the models have difficulty understanding synonymous representations of actions in different motion phases. (2) Pseudo labeling selection based on the constant thresholds lacks a "make-up curriculum" for difficult actions, that results in the low utilization of unlabeled data corresponding to difficult actions. To address the above shortcomings, we propose a semi-supervised action recognition via the temporal augmentation using curriculum learning (TACL) algorithm. Compared to previous works, TACL explores different representations of the same semantics of actions in temporal sequences for video and uses the idea of curriculum learning (CL) to reduce the difficulty of the model training process. First, for different action expressions with the same semantics, we designed the temporal action augmentation (TAA) for videos to obtain coarse-grained and fine-grained action expressions based on constant-velocity and hetero-velocity methods, respectively. Second, we construct a temporal signal to constrain the model such that fine-grained action expressions containing different movement phases have the same prediction results, and achieve action consistency learning (ACL) by combining the label and pseudo-label signals. Finally, we propose action curriculum pseudo labeling (ACPL), a loosely and strictly parallel dynamic threshold evaluation algorithm for selecting and labeling unlabeled data. We evaluate TACL on three standard public datasets: UCF101, HMDB51, and Kinetics. The combined experiments show that TACL significantly improves the accuracy of models trained on a small amount of labeled data and better evaluates the learning effects for different actions.
引用
收藏
页码:1305 / 1319
页数:15
相关论文
共 75 条
[1]  
Afonso L., 2021, BRAZ C INT SYST, P514
[2]  
Ahsan U., 2018, arXiv
[3]   Long Short View Feature Decomposition via Contrastive Video Representation Learning [J].
Behrmann, Nadine ;
Fayyaz, Mohsen ;
Gall, Juergen ;
Noroozi, Mehdi .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9224-9233
[4]   SpeedNet: Learning the Speediness in Videos [J].
Benaim, Sagie ;
Ephrat, Ariel ;
Lang, Oran ;
Mosseri, Inbar ;
Freeman, William T. ;
Rubinstein, Michael ;
Irani, Michal ;
Dekel, Tali .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9919-9928
[5]  
Bengio Y., 2009, P 26 ANN INT C MACH, P41, DOI DOI 10.1145/1553374.1553380
[6]  
Berthelot D, 2019, ADV NEUR IN, V32
[7]   Multi-View Super Vector for Action Recognition [J].
Cai, Zhuowei ;
Wang, Limin ;
Peng, Xiaojiang ;
Qiao, Yu .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :596-603
[8]  
Chen PH, 2021, AAAI CONF ARTIF INTE, V35, P1045
[9]   SportsCap: Monocular 3D Human Motion Capture and Fine-Grained Understanding in Challenging Sports Videos [J].
Chen, Xin ;
Pang, Anqi ;
Yang, Wei ;
Ma, Yuexin ;
Xu, Lan ;
Yu, Jingyi .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (10) :2846-2864
[10]  
Christoph R., 2016, NIPS 16, V2, P3468