Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition

被引：3

作者：

Wang, Guanhong ^{[1
,2
]}

Zhou, Yang ^{[1
]}

He, Zhanhao ^{[1
]}

Lu, Keyu ^{[1
]}

Feng, Yang ^{[3
]}

Liu, Zuozhu ^{[1
,2
]}

Wang, Gaoang ^{[1
,2
]}

机构：

[1] Zhejiang Univ, Zhejiang Univ Univ Illinois Urbana Champaign Inst, Haining 314400, Peoples R China

[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China

[3] Angelalign Inc, Angelalign Res Inst, Shanghai 200011, Peoples R China

来源：

NEUROCOMPUTING | 2024年 / 571卷

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Video representation learning; Knowledge distillation; Action recognition; Video retrieval;

D O I：

10.1016/j.neucom.2023.127136

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video-based action recognition is an important task in the computer vision community, aiming to extract rich spatial-temporal information to recognize human actions from videos. Many approaches adopt self-supervised learning in large-scale unlabeled datasets and exploit transfer learning in the downstream action recognition task. Though much progress has been made for action recognition with video representation learning, two main issues remain for most existing methods. Firstly, the pre-training with self-supervised pretext tasks usually learns neutral and not much informative representations for the downstream action recognition task. Secondly, the valuable learned knowledge from large-scaled pre-training datasets will be gradually forgotten in the fine-tuning stage. To address such issues, in this paper, we propose a novel video representation learning method with knowledge-guided pre-training and fine-tuning for action recognition, which incorporates external human parsing knowledge for generating informative representation in the pre-training, and preserves the pre-trained knowledge in the fine-tuning stage to avoid catastrophic forgetting via self-distillation. Our model, with contributions from the external human parsing knowledge, video-level contrastive learning, and knowledge preserving self-distillation, achieves state-of-the-art performance on two popular benchmarks, i.e., UCF101 and HMDB51, verifying the effectiveness of the proposed method.

引用

页数：10

共 15 条

[1] KNOWLEDGE DISTILLATION FROM BERT IN PRE-TRAINING AND FINE-TUNING FOR POLYPHONE DISAMBIGUATION
Sun, Hao
Tan, Xu
Gan, Jun-Wei
Zhao, Sheng
Han, Dongxu
Liu, Hongzhi
Qin, Tao
Liu, Tie-Yan
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 168 - 175
[2] Learning hierarchical video representation for action recognition
Li Q.
Qiu Z.
Yao T.
Mei T.
Rui Y.
Luo J.
International Journal of Multimedia Information Retrieval, 2017, 6 (1) : 85 - 98
[3] Pre-training for Action Recognition with Automatically Generated Fractal Datasets
Svyezhentsev, Davyd
Retsinas, George
Maragos, Petros
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, : 4923 - 4943
[4] Joint Pre-training and Local Re-training: Transferable Representation Learning on Multi-source Knowledge Graphs
Sun, Zequn
Huang, Jiacheng
Lin, Jinghao
Xu, Xiaozhou
Chen, Qijin
Hu, Wei
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 2132 - 2144
[5] Collaboratively Self-Supervised Video Representation Learning for Action Recognition
Zhang, Jie
Wan, Zhifan
Hu, Lanqing
Lin, Stephen
Wu, Shuzhe
Shan, Shiguang
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1895 - 1907
[6] Spatiotemporal Saliency Representation Learning for Video Action Recognition
Kong, Yongqiang
Wang, Yunhong
Li, Annan
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
[7] Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition
Lu, Mingqi
Yang, Siyuan
Lu, Xiaobo
Liu, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9798 - 9807
[8] JOINT LEARNING ON THE HIERARCHY REPRESENTATION FOR FINE-GRAINED HUMAN ACTION RECOGNITION
Leong, Mei Chee
Tan, Hui Li
Zhang, Haosong
Li, Liyuan
Lin, Feng
Lim, Joo Hwee
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1059 - 1063
[9] Dynamic Representation Learning for Video Action Recognition Using Temporal Residual Networks
Kong, Yongqiang
Huang, Jianhui
Huang, Shanshan
Wei, Zhengang
Wang, Shengke
2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 331 - 337
[10] Multi-teacher knowledge distillation for compressed video action recognition based on deep learning
Wu, Meng-Chieh
Chiu, Ching-Te
JOURNAL OF SYSTEMS ARCHITECTURE, 2020, 103

← 1 2 →