共 38 条
[1]
WANG L, XIONG Y, WANG Z, Et al., Temporal segment networks: towards good practices for deep action recognition, Proceedings of the European Conference on Computer Vision, pp. 20-36, (2016)
[2]
TRAN D, BOURDEV L, FERGUS R, Et al., Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, (2015)
[3]
ABU-EL-HAIJA S, KOTHARI N, LEE J, Et al., Youtube-8m: a large-scale video classification benchmark
[4]
KAY W, CARREIRA J, SIMONYAN K, Et al., The kinetics human action video dataset
[5]
HE K, FAN H, WU Y, Et al., Momentum contrast for unsupervised visual representation learning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729-9738, (2020)
[6]
CHEN T, KORNBLITH S, NOROUZI M, Et al., A simple framework for contrastive learning of visual representations, Proceedings of the International Conference on Machine Learning, pp. 1597-1607, (2020)
[7]
GRILL J B, STRUB F, ALTCHE F, Et al., Bootstrap your own latent a new approach to self-supervised learning, Advances in Neural Information Processing Systems, pp. 21271-21284, (2020)
[8]
CARON M, MISRA I, MAIRAL J, Et al., Unsupervised learning of visual features by contrasting cluster assignments, Advances in Neural Information Processing Systems, pp. 9912-9924, (2020)
[9]
CHEN X, HE K., Exploring simple siamese representation learning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750-15758, (2021)
[10]
HAN T, XIE W, ZISSERMAN A., Self-supervised co-training for video representation learning, Advances in Neural Information Processing Systems, pp. 5679-5690, (2020)