Text-guided Graph Temporal Modeling for few-shot video classification

被引：0

作者：

Deng, Fuqin ^{[1
,6
,7
]}

Zhong, Jiaming ^{[1
,3
]}

Li, Nannan ^{[2
]}

Fu, Lanhui ^{[1
]}

Jiang, Bingchun ^{[3
]}

Yi, Ningbo ^{[5
]}

Qi, Feng ^{[4
]}

Xin, He ^{[4
]}

Lam, Tin Lun ^{[7
]}

机构：

[1] Wuyi Univ, Sch Elect & Informat Engn, Jiangmen, Peoples R China

[2] Macau Univ Sci & Technol, Fac Innovat Engn, Sch Comp Sci & Engn, Macau, Peoples R China

[3] Guangdong Univ Sci & Technol, Sch Mech & Elect Engn, Dongguan, Peoples R China

[4] Wuyi Univ, Sch Appl Phys & Mat Sci, Jiangmen, Peoples R China

[5] Wuyi Univ, Sch Text Mat & Engn, Jiangmen, Peoples R China

[6] Shenzhen Vatop Semicon Tech Co Ltd, Shenzhen, Peoples R China

[7] Chinese Univ Hong Kong, Shenzhen Inst Artificial Intelligence & Robot Soc, Sch Sci & Engn, Shenzhen, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 137卷

基金：

中国国家自然科学基金;

关键词：

Few-shot video classification; Multi-modal learning; Large model application; Graph Temporal Network;

D O I：

10.1016/j.engappai.2024.109076

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale pre-trained models and graph neural networks have recently demonstrated remarkable success in few-shot video classification tasks. However, they generally suffer from two key limitations: i) the temporal relations between adjacent frames tends to be ambiguous due to the lack of explicit temporal modeling. ii) the absence of multi-modal semantic knowledge in query videos results in inaccurate prototypes construction and an inability to achieve multi-modal temporal alignment metrics. To address these issues, we develop a Text- guided Graph Temporal Modeling (TgGTM) method that consists of two crucial components: a text-guided feature refinement module and a learnable Query text-token contrastive objective. Specifically, the former leverages the Temporal masking layer to guide the model in learning temporal relationships between adjacent frames. Additionally, it utilizes multi-modal information to refine video prototypes for comprehensive few- shot video classification. The latter addresses the feature discrepancy between multi-modal support features and single-modal query features by aligning a learnable Query text-token with corresponding base class text descriptions. Extensive experiments on four commonly used benchmarks demonstrate the effectiveness of our proposed method, which achieves mean accuracies of 54.4%, 80.3%, 91.9%, and 96.2% for 5-way 1shot classification on SSV2-Small, HMDB51, Kinetics, and UCF101, respectively. These results are superior compared to existing state-of-the-art methods. A detailed ablation showcases the importance of learning temporal relationships between adjacent frames and obtaining Query text-token. The source code and models will be publicly available at https://github.com/JiaMingZhong2621/TgGTM.

引用

页数：12