UPL-Net: Uncertainty-aware prompt learning network for semi-supervised action recognition

被引:0
作者
Yang, Shu [1 ]
Li, Ya-Li [1 ]
Wang, Shengjin [1 ]
机构
[1] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
关键词
Semi-supervised learning; Prompt learning; Vision-language pre-training; Action recognition; Uncertainty estimation;
D O I
10.1016/j.neucom.2024.129126
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper focuses on understanding human behavior in videos by reframing the traditional video classification task as a transfer learning problem centered on visual concepts. Unlike existing action recognition approaches that rely solely on single-modal representations and video classifiers, our method leverages an uncertainty- aware prompt learning network (UPL-Net). This network is designed to extract spatiotemporal features that are pertinent to action-related concepts in videos while ensuring that the visual concepts derived from images are preserved. Furthermore, we introduce an uncertainty-guided semi-supervised learning strategy that harnesses unlabeled videos to enhance the model's generalizability. Extensive experiments conducted on benchmark datasets, namely UCF and HMDB, demonstrate the superiority of our approach over state-of-the-art semi- supervised action recognition methods. Notably, under a 1% labeling rate on the UCF dataset, our method achieves a significant improvement of 12.8%, underscoring its effectiveness in leveraging limited labeled data and abundant unlabeled videos for improved performance.
引用
收藏
页数:11
相关论文
共 61 条
[1]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[2]  
Brown TB, 2020, ADV NEUR IN, V33
[3]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   VLP: A Survey on Vision-language Pre-training [J].
Chen, Fei-Long ;
Zhang, Du-Zhen ;
Han, Ming-Lun ;
Chen, Xiu-Yi ;
Shi, Jing ;
Xu, Shuang ;
Xu, Bo .
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (01) :38-56
[6]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[7]  
Chen YZ, 2023, AAAI CONF ARTIF INTE, P396
[8]  
Dass SDS, 2024, Arxiv, DOI arXiv:2404.06243
[9]   TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition [J].
Dave, Ishan Rajendrakumar ;
Rizve, Mamshad Nayeem ;
Chen, Chen ;
Shah, Mubarak .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :2341-2352
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171