Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning

被引:39
作者
Zheng, Sipeng [1 ]
Chen, Shizhe [2 ]
Jin, Qin [1 ]
机构
[1] Renmin Univ China, Beijing, Peoples R China
[2] INRIA, Paris, France
来源
COMPUTER VISION - ECCV 2022, PT IV | 2022年 / 13664卷
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Few-shot learning; Action recognition; Contrastive learning;
D O I
10.1007/978-3-031-19772-7_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Few-shot action recognition aims to recognize actions in test videos based on limited annotated data of target action classes. The dominant approaches project videos into a metric space and classify videos via nearest neighboring. They mainly measure video similarities using global or temporal alignment alone, while an optimum matching should be multi-level. However, the complexity of learning coarse-to-fine matching quickly rises as we focus on finer-grained visual cues, and the lack of detailed local supervision is another challenge. In this work, we propose a hierarchical matching model to support comprehensive similarity measure at global, temporal and spatial levels via a zoom-in matching module. We further propose a mixed-supervised hierarchical contrastive learning (HCL), which not only employs supervised contrastive learning to differentiate videos at different levels, but also utilizes cycle consistency as weak supervision to align discriminative temporal clips or spatial patches. Our model achieves state-of-the-art performance on four benchmarks especially under the most challenging 1-shot recognition setting.
引用
收藏
页码:297 / 313
页数:17
相关论文
共 47 条
[1]  
[Anonymous], 2024, PROC ANN M COGNITIVE
[2]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[3]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[4]  
Bowen C., 2021, NeurIPS
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]  
Devlin J., 2018, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
[7]  
Dosovitskiy A., 2021, ICLR
[8]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[9]   Temporal Cycle-Consistency Learning [J].
Dwibedi, Debidatta ;
Aytar, Yusuf ;
Tompson, Jonathan ;
Sermanet, Pierre ;
Zisserman, Andrew .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1801-1810
[10]   ProtoGAN: Towards Few Shot Learning for Action Recognition [J].
Dwivedi, Sai Kumar ;
Gupta, Vikram ;
Mitra, Rahul ;
Ahmed, Shuaib ;
Jain, Arjun .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1308-1316