Zero-Shot Action Recognition Based on CLIP Model and Knowledge Database

被引：0

作者：

Yonghong, Hou ^{[1
]}

Haochun, Zheng ^{[2
]}

Jiajun, Gao ^{[1
]}

Yi, Ren ^{[3
]}

机构：

[1] School of Electrical and Information Engineering, Tianjin University, Tianjin

[2] School of Future Technology, Tianjin University, Tianjin

[3] Institute of Software, Chinese Academy of Sciences, Beijing

来源：

Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/Journal of Tianjin University Science and Technology | 2025年 / 58卷 / 01期

基金：

中国国家自然科学基金;

关键词：

action recognition; contrastive language-image pre-training(CLIP) model; knowledge database; zero-shot learning(ZSL);

D O I：

10.11784/tdxbz202403012

中图分类号：

学科分类号：

摘要：

Zero-shot action recognition(ZSAR)aims to learn knowledge from seen action classes and apply it to unseen action classes，thereby achieving recognition and classification of unknown action samples. However，existing ZSAR models are limited by the amount of training data. This restricts their capability to learn prior knowledge and the accurate mapping of visual features with semantic labels. To address this issue，a ZSAR framework was proposed in this study by introducing an external knowledge database and using the contrastive language-image pre-training(CLIP)model. This framework utilized the knowledge acquired through self-supervised contrastive learning by the multimodal CLIP model to expand the prior knowledge of ZSAR. Moreover，a temporal encoder was designed to compensate for the lack of temporal modeling capability of the CLIP model. To enhance semantic features and bridge the gap between visual features and semantic labels，the semantic labels of seen action classes were extended. This involved replacing simple text labels with more detailed descriptive sentences to enrich the semantic information of text representations. On this basis，a knowledge database was constructed outside the model. This approach provided additional information without increasing the model parameter scale and strengthens the association between the visual and text features. Finally，following the ZSAR protocol，the model was fine-tuned for the ZSAR task to improve its generalization ability. Furthermore，the proposed method was extensively experimented on two mainstream datasets：HMDB51 and UCF101. The experimental results demonstrate significant improvements of 3.8% and 2.3% on the above two datasets，respectively，compared with previous methods，validating the effectiveness of the proposed approach. © 2025 Tianjin University. All rights reserved.

引用

页码：91 / 100

页数：9

共 39 条

[1]

Palatucci M，, Pomerleau D，, Hinton G E, Et al., Zero-shot learning with semantic output codes[C], Neural Information Processing Systems, pp. 1063-6919, (2009)

[2]

Tian Y，, Kong Y，, Ruan Q Q, Et al., Aligned dynamic-preserving embedding for zero-shot action recognition[J], IEEE Transactions on Circuits and Systems for Video Technology, 30, 6, pp. 1597-1612, (2020)

[3]

Ji Zhong, Guo Weichen, Zero shot action reamgnition based on local preserving canonical conelotion analysis[J], Journal of Tianjin University(Science and Technology), 50, 9, pp. 975-983, (2017)

[4]

Liu J G，, Kuipers B, Savarese S., Recognizing human actions by attributes[C], The 24th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3337-3344, (2011)

[5]

Kodirov E, Xiang T, Et al., Unsupervised domain adaptation for zero-shot learning[C], 2015 IEEE International Conference on Computer Vision, pp. 2452-2460, (2015)

[6]

Roitberg A, Al-Halah Z，, Stiefelhagen R., Informed democracy：Voting-based novelty detection for action recognition

[7]

Brattoli B，, Tighe J, Zhdanov F, Et al., Rethinking zero-shot video classification：End-to-end training for realistic applications[C], 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 4612-4622, (2020)

[8]

Mandal D，, Narayan S，, Dwivedi S K, Et al., Out-of-distribution detection for generalized zero-shot action recognition[C], 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 9977-9985, (2019)

[9]

Xu X, Hospedales T, Gong S G., Semantic embedding space for zero-shot action recognition[C], 2015 IEEE International Conference on Image Processing(ICIP), pp. 63-67, (2015)

[10]

Gan C，, Lin M, Yang Y, Et al., Concepts not alone：Exploring pairwise relationships for zero-shot video activity recognition[C], Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3487-3493, (2016)

← 1 2 3 4 →