Few-Shot Action Recognition via Multi-View Representation Learning

被引：5

作者：

Wang, Xiao ^{[1
]}

Lu, Yang ^{[1
]}

Yu, Wanchuan ^{[1
]}

Pang, Yanwei ^{[2
,3
]}

Wang, Hanzi ^{[1
,3
]}

机构：

[1] Xiamen Univ, Fujian Key Lab Sensing & Comp Smart City, Sch Informat, Xiamen 361005, Peoples R China

[2] Tianjin Univ, Sch Elect & Informat Engn, Tianjin Key Lab Brain Inspired Intelligence Techn, Tianjin 300072, Peoples R China

[3] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Few-shot learning; action recognition; meta-learning; multi-view representation learning;

D O I：

10.1109/TCSVT.2024.3384875

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Few-shot action recognition aims to recognize novel action classes with limited labeled samples and has recently received increasing attention. The core objective of few-shot action recognition is to enhance the discriminability of feature representations. In this paper, we propose a novel multi-view representation learning network (MRLN) to model intra-video and inter-video relations for few-shot action recognition. Specifically, we first propose a spatial-aware aggregation refinement module (SARM), which mainly consists of a spatial-aware aggregation sub-module and a spatial-aware refinement sub-module to explore the spatial context of samples at the frame level. Then, we design a temporal-channel enhancement module (TCEM), which can capture the temporal-aware and channel-aware features of samples with the elaborately designed temporal-aware enhancement sub-module and channel-aware enhancement sub-module. Third, we introduce a cross-video relation module (CVRM), which can explore the relations across videos by utilizing the self-attention mechanism. Moreover, we design a prototype-centered mean absolute error loss to improve the feature learning capability of the proposed MRLN. Extensive experiments on four prevalent few-shot action recognition benchmarks show that the proposed MRLN can significantly outperform a variety of state-of-the-art few-shot action recognition methods. Especially, on the 5-way 1-shot setting, our MRLN respectively achieves 75.7%, 86.9%, 65.5% and 45.9% on the Kinetics, UCF101, HMDB51 and SSv2 datasets.

引用

页码：8522 / 8535

页数：14

共 79 条

[11] X3D: Expanding Architectures for Efficient Video Recognition [J].

Feichtenhofer, Christoph .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :200-210

[12] Convolutional Two-Stream Network Fusion for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Zisserman, Andrew .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941

[13]

Finn C, 2017, PR MACH LEARN RES, V70

[14] Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [J].

Fu, Yuqian ;

Zhang, Li ;

Wang, Junke ;

Fu, Yanwei ;

Jiang, Yu-Gang .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1142-1151

[15] Embodied One-Shot Video Recognition: Learning from Actions of a Virtual Embodied Agent [J].

Fu, Yuqian ;

Wang, Chengrong ;

Fu, Yanwei ;

Wang, Yu-Xiong ;

Bai, Cong ;

Xue, Xiangyang ;

Jiang, Yu-Gang .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :411-419

[16] The "something something" video database for learning and evaluating visual common sense [J].

Goyal, Raghav ;

Kahou, Samira Ebrahimi ;

Michalski, Vincent ;

Materzynska, Joanna ;

Westphal, Susanne ;

Kim, Heuna ;

Haenel, Valentin ;

Fruend, Ingo ;

Yianilos, Peter ;

Mueller-Freitag, Moritz ;

Hoppe, Florian ;

Thurau, Christian ;

Bax, Ingo ;

Memisevic, Roland .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851

[17] Temporal Alignment Networks for Long-term Video [J].

Han, Tengda ;

Xie, Weidi ;

Zisserman, Andrew .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2896-2906

[18] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].

Hara, Kensho ;

Kataoka, Hirokatsu ;

Satoh, Yutaka .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555

[19] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[20]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]

← 1 2 3 4 5 6 7 8 →