Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation

被引:0
作者
Wang, Lulu [1 ,2 ]
Xu, Zengmin [1 ,2 ,3 ]
Zhang, Xuelian [1 ,2 ]
Meng, Ruxing [3 ]
Lu, Tao [4 ]
机构
[1] Guangxi Colleges, Universities Key Laboratory of Data Analysis and Computation, School of Mathematics and Computing Science, Guilin University of Electronic Technology, Guangxi, Guilin
[2] Center for Applied Mathematics of Guangxi (GUET), Guangxi, Guilin
[3] Anview.ai, Guangxi, Guilin
[4] Hubei Key Laboratory of Intelligent Robot, School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan
关键词
cross-view co-training; local contrastive learning; self-supervised learning; temporal contrastive learning; video representation learning;
D O I
10.3778/j.issn.1002-8331.2312-0033
中图分类号
学科分类号
摘要
The existing self-supervised representation algorithms mainly focus on the short-term motion characteristics between video frames, but the variation range of the action sequence between frames is small, and the depth feature expression ability of single-view data is affected due to semantic limitations, so the rich multi-view information in video actions is not fully utilized. Therefore, a temporal contrast learning algorithm based on cross-view semantic consistency is proposed to self-supervised learn the action temporal variation characteristics embedded in both RGB frames and optical flow field data. The main ideas are as follows: to design a local temporal contrast learning method, adopt different positive and negative sample division strategies to explore the temporal correlation and discriminative differentiability between non-overlapping segments of the same instance, and enhance the fine-grained feature expression capability; to study the global contrast learning method to increase the positive samples by cross-view semantic co-training, learn the semantic consistency of different views of multiple instances, and improve the generalization ability of the model. The model performance is evaluated through two downstream tasks, and the experimental results on UCF101 and HMDB51 datasets show that the proposed method improves on average 2~3.5 percentage points over cutting-edge mainstream methods on action recognition and video retrieval tasks. © 2024 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.
引用
收藏
页码:158 / 166
页数:8
相关论文
共 38 条
[1]  
WANG L, XIONG Y, WANG Z, Et al., Temporal segment networks: towards good practices for deep action recognition, Proceedings of the European Conference on Computer Vision, pp. 20-36, (2016)
[2]  
TRAN D, BOURDEV L, FERGUS R, Et al., Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, (2015)
[3]  
ABU-EL-HAIJA S, KOTHARI N, LEE J, Et al., Youtube-8m: a large-scale video classification benchmark
[4]  
KAY W, CARREIRA J, SIMONYAN K, Et al., The kinetics human action video dataset
[5]  
HE K, FAN H, WU Y, Et al., Momentum contrast for unsupervised visual representation learning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729-9738, (2020)
[6]  
CHEN T, KORNBLITH S, NOROUZI M, Et al., A simple framework for contrastive learning of visual representations, Proceedings of the International Conference on Machine Learning, pp. 1597-1607, (2020)
[7]  
GRILL J B, STRUB F, ALTCHE F, Et al., Bootstrap your own latent a new approach to self-supervised learning, Advances in Neural Information Processing Systems, pp. 21271-21284, (2020)
[8]  
CARON M, MISRA I, MAIRAL J, Et al., Unsupervised learning of visual features by contrasting cluster assignments, Advances in Neural Information Processing Systems, pp. 9912-9924, (2020)
[9]  
CHEN X, HE K., Exploring simple siamese representation learning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750-15758, (2021)
[10]  
HAN T, XIE W, ZISSERMAN A., Self-supervised co-training for video representation learning, Advances in Neural Information Processing Systems, pp. 5679-5690, (2020)