Text-guided Graph Temporal Modeling for few-shot video classification

被引:0
|
作者
Deng, Fuqin [1 ,6 ,7 ]
Zhong, Jiaming [1 ,3 ]
Li, Nannan [2 ]
Fu, Lanhui [1 ]
Jiang, Bingchun [3 ]
Yi, Ningbo [5 ]
Qi, Feng [4 ]
Xin, He [4 ]
Lam, Tin Lun [7 ]
机构
[1] Wuyi Univ, Sch Elect & Informat Engn, Jiangmen, Peoples R China
[2] Macau Univ Sci & Technol, Fac Innovat Engn, Sch Comp Sci & Engn, Macau, Peoples R China
[3] Guangdong Univ Sci & Technol, Sch Mech & Elect Engn, Dongguan, Peoples R China
[4] Wuyi Univ, Sch Appl Phys & Mat Sci, Jiangmen, Peoples R China
[5] Wuyi Univ, Sch Text Mat & Engn, Jiangmen, Peoples R China
[6] Shenzhen Vatop Semicon Tech Co Ltd, Shenzhen, Peoples R China
[7] Chinese Univ Hong Kong, Shenzhen Inst Artificial Intelligence & Robot Soc, Sch Sci & Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Few-shot video classification; Multi-modal learning; Large model application; Graph Temporal Network;
D O I
10.1016/j.engappai.2024.109076
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale pre-trained models and graph neural networks have recently demonstrated remarkable success in few-shot video classification tasks. However, they generally suffer from two key limitations: i) the temporal relations between adjacent frames tends to be ambiguous due to the lack of explicit temporal modeling. ii) the absence of multi-modal semantic knowledge in query videos results in inaccurate prototypes construction and an inability to achieve multi-modal temporal alignment metrics. To address these issues, we develop a Text- guided Graph Temporal Modeling (TgGTM) method that consists of two crucial components: a text-guided feature refinement module and a learnable Query text-token contrastive objective. Specifically, the former leverages the Temporal masking layer to guide the model in learning temporal relationships between adjacent frames. Additionally, it utilizes multi-modal information to refine video prototypes for comprehensive few- shot video classification. The latter addresses the feature discrepancy between multi-modal support features and single-modal query features by aligning a learnable Query text-token with corresponding base class text descriptions. Extensive experiments on four commonly used benchmarks demonstrate the effectiveness of our proposed method, which achieves mean accuracies of 54.4%, 80.3%, 91.9%, and 96.2% for 5-way 1shot classification on SSV2-Small, HMDB51, Kinetics, and UCF101, respectively. These results are superior compared to existing state-of-the-art methods. A detailed ablation showcases the importance of learning temporal relationships between adjacent frames and obtaining Query text-token. The source code and models will be publicly available at https://github.com/JiaMingZhong2621/TgGTM.
引用
收藏
页数:12
相关论文
empty
未找到相关数据