EVA: Enabling Video Attributes With Hierarchical Prompt Tuning for Action Recognition

被引:0
作者
Ruan, Xiangning [1 ]
Yin, Qixiang [1 ]
Su, Fei [1 ]
Zhao, Zhicheng [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing 100876, Peoples R China
关键词
Feature extraction; Transformers; Visualization; Tuning; Adaptation models; Streaming media; Semantics; Computational modeling; Accuracy; Dictionaries; Parameter efficient transfer learning; prompt-based learning; action recognition; transformer;
D O I
10.1109/LSP.2025.3533307
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The pretraining and fine-tuning paradigm has excelled in action recognition. However, full fine-tuning is computationally and storage costly, while parameter-efficient fine-tuning (PEFT) always sacrifices accuracy and stability. To address these challenges, we propose a novel method, Enabling Video Attributes with Hierarchical Prompt Tuning (EVA), to guide action recognition. Firstly, instead of focusing solely on temporal features, EVA sparsely extracts six types of video attributes across two modalities, capturing the relatively gradual attribute changes in actions. Secondly, a hierarchical prompt tuning architecture with multiscale attribute prompts is introduced to learn the differences in actions. Finally, by adjusting only a small number of additional parameters, EVA outperforms all PEFT and most full fine-tuning methods across four widely used datasets (Something-Something V2, ActivityNet, HMDB51, and UCF101), demonstrating its effectiveness.
引用
收藏
页码:971 / 975
页数:5
相关论文
共 37 条
  • [1] Zhai X., Kolesnikov A., Houlsby N., Beyer L., Scaling vision transformers, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 12104-12113, (2022)
  • [2] Jia M., Wu Z., Reiter A., Cardie C., Belongie S., Lim S.-N., Exploring visual engagement signals for representation learning, Proc. IEEE/CVF Int. Conf. Comput. Vis., pp. 4206-4217, (2021)
  • [3] Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of deep bidirectional transformers for language understanding, NAACLHLT, pp. 4171-4186, (2019)
  • [4] Zaken E.B., Ravfogel S., Goldberg Y., BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, ACL, pp. 1-9, (2022)
  • [5] He K., Chen X., Xie S., Li Y., Dollar P., Girshick R., Masked autoencoders are scalable vision learners, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 16000-16009, (2022)
  • [6] Hu E.J., Et al., LoRA: Low-rank adaptation of large language models, Proc. Int. Conf. Learn. Representations, (2022)
  • [7] Qiu X., Hao T., Shi S., Tan X., Xiong Y.J., Chain-of-LoRA: Enhancing the instruction fine-tuning performance of low-rank adaptation on diverse instruction set, IEEE Signal Process. Lett., 31, pp. 875-879, (2024)
  • [8] Qi W., Ruan Y.P., Zuo Y., Li T., Parameter-efficient tuning on layer normalization for pre-trained language models, CoRR
  • [9] Houlsby N., Et al., Parameter-efficient transfer learning for NLP, Proc. 36th Int. Conf. Mach. Learn., pp. 2790-2799, (2019)
  • [10] Chen S., Et al., Adaptformer: Adapting vision transformers for scalable visual recognition, Proc. Adv. Neural Inf. Process. Syst., pp. 16664-16678, (2022)