MLKD-CLIP: Multi-layer Feature Knowledge Distillation of CLIP for Open-vocabulary Action Recognition

被引：0

作者：

Jingjing Wang ^{[1
]}

Junyong Ye ^{[1
]}

Xinyuan Liu ^{[1
]}

Youwei Li ^{[1
]}

Guangyi Xu ^{[1
]}

Chaoming Zheng ^{[1
]}

机构：

[1] Chongqing University,Key Laboratory of Optoelectronic Technology and Systems of the Ministry of Education

来源：

Multimedia Systems | 2025年 / 31卷 / 3期

关键词：

Video understanding; Action recognition; Knowledge distillation; Open-vocabulary; Multi-layer feature fusion;

D O I：

10.1007/s00530-025-01836-z

中图分类号：

学科分类号：

摘要：

Open-vocabulary action recognition aims to identify unseen action categories during training. Large-scale vision-language pre-trained models (such as CLIP) excel in open-vocabulary image tasks due to their strong generalizability. However, CLIP lacks temporal information. Since video datasets are much smaller than the pre-training datasets, direct fine-tuning may lead to the loss of generalization of CLIP, making it difficult to recognize unseen actions. To this end, we propose MLKD-CLIP, which uses the frozen CLIP as the teacher and the fine-tuned CLIP as the student to perform multi-layer feature knowledge distillation. Firstly, we introduce a feature fusion module that employs self-attention to merge features from different layers and incorporates a temporal convolution module, enabling the model to maintain the learning of temporal capabilities during the distillation process. Next, we perform layer-wise fusion to combine the multi-layer features of both the teacher and student models, allowing the model to balance their significance in the distillation process. Finally, we distill the fused features, enabling the student model to learn the multi-level features of the teacher model while considering both global representations and local details. Additionally, the classification tasks on video datasets enhance the student model to learn video features. We evaluated the open-vocabulary action recognition capability of MLKD-CLIP on the UCF101, HMDB51, and SSv2 datasets, achieving the best top-1 accuracy compared to popular methods. MLKD-CLIP offers a new perspective for the task of open-vocabulary action recognition.

引用