Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition

被引:3
|
作者
Wang, Guanhong [1 ,2 ]
Zhou, Yang [1 ]
He, Zhanhao [1 ]
Lu, Keyu [1 ]
Feng, Yang [3 ]
Liu, Zuozhu [1 ,2 ]
Wang, Gaoang [1 ,2 ]
机构
[1] Zhejiang Univ, Zhejiang Univ Univ Illinois Urbana Champaign Inst, Haining 314400, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[3] Angelalign Inc, Angelalign Res Inst, Shanghai 200011, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Video representation learning; Knowledge distillation; Action recognition; Video retrieval;
D O I
10.1016/j.neucom.2023.127136
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-based action recognition is an important task in the computer vision community, aiming to extract rich spatial-temporal information to recognize human actions from videos. Many approaches adopt self-supervised learning in large-scale unlabeled datasets and exploit transfer learning in the downstream action recognition task. Though much progress has been made for action recognition with video representation learning, two main issues remain for most existing methods. Firstly, the pre-training with self-supervised pretext tasks usually learns neutral and not much informative representations for the downstream action recognition task. Secondly, the valuable learned knowledge from large-scaled pre-training datasets will be gradually forgotten in the fine-tuning stage. To address such issues, in this paper, we propose a novel video representation learning method with knowledge-guided pre-training and fine-tuning for action recognition, which incorporates external human parsing knowledge for generating informative representation in the pre-training, and preserves the pre-trained knowledge in the fine-tuning stage to avoid catastrophic forgetting via self-distillation. Our model, with contributions from the external human parsing knowledge, video-level contrastive learning, and knowledge preserving self-distillation, achieves state-of-the-art performance on two popular benchmarks, i.e., UCF101 and HMDB51, verifying the effectiveness of the proposed method.
引用
收藏
页数:10
相关论文
共 15 条
  • [1] KNOWLEDGE DISTILLATION FROM BERT IN PRE-TRAINING AND FINE-TUNING FOR POLYPHONE DISAMBIGUATION
    Sun, Hao
    Tan, Xu
    Gan, Jun-Wei
    Zhao, Sheng
    Han, Dongxu
    Liu, Hongzhi
    Qin, Tao
    Liu, Tie-Yan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 168 - 175
  • [2] Learning hierarchical video representation for action recognition
    Li Q.
    Qiu Z.
    Yao T.
    Mei T.
    Rui Y.
    Luo J.
    International Journal of Multimedia Information Retrieval, 2017, 6 (1) : 85 - 98
  • [3] Pre-training for Action Recognition with Automatically Generated Fractal Datasets
    Svyezhentsev, Davyd
    Retsinas, George
    Maragos, Petros
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, : 4923 - 4943
  • [4] Joint Pre-training and Local Re-training: Transferable Representation Learning on Multi-source Knowledge Graphs
    Sun, Zequn
    Huang, Jiacheng
    Lin, Jinghao
    Xu, Xiaozhou
    Chen, Qijin
    Hu, Wei
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 2132 - 2144
  • [5] Collaboratively Self-Supervised Video Representation Learning for Action Recognition
    Zhang, Jie
    Wan, Zhifan
    Hu, Lanqing
    Lin, Stephen
    Wu, Shuzhe
    Shan, Shiguang
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 1895 - 1907
  • [6] Spatiotemporal Saliency Representation Learning for Video Action Recognition
    Kong, Yongqiang
    Wang, Yunhong
    Li, Annan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
  • [7] Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition
    Lu, Mingqi
    Yang, Siyuan
    Lu, Xiaobo
    Liu, Jun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9798 - 9807
  • [8] JOINT LEARNING ON THE HIERARCHY REPRESENTATION FOR FINE-GRAINED HUMAN ACTION RECOGNITION
    Leong, Mei Chee
    Tan, Hui Li
    Zhang, Haosong
    Li, Liyuan
    Lin, Feng
    Lim, Joo Hwee
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1059 - 1063
  • [9] Dynamic Representation Learning for Video Action Recognition Using Temporal Residual Networks
    Kong, Yongqiang
    Huang, Jianhui
    Huang, Shanshan
    Wei, Zhengang
    Wang, Shengke
    2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 331 - 337
  • [10] Multi-teacher knowledge distillation for compressed video action recognition based on deep learning
    Wu, Meng-Chieh
    Chiu, Ching-Te
    JOURNAL OF SYSTEMS ARCHITECTURE, 2020, 103