Collaboratively Self-Supervised Video Representation Learning for Action Recognition

被引:0
作者
Zhang, Jie [1 ,2 ]
Wan, Zhifan [1 ,2 ]
Hu, Lanqing [1 ,2 ]
Lin, Stephen [3 ]
Wu, Shuzhe [4 ]
Shan, Shiguang [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol ICT, Key Lab AI Safety CAS, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci UCAS, Sch Comp Sci & Technol, Beijing 100049, Peoples R China
[3] Microsoft Res Asia, Beijing 100080, Peoples R China
[4] Beijing Huawei Digital Technol Co Ltd, Beijing 100095, Peoples R China
关键词
Representation learning; Dynamics; Generative adversarial networks; Feature extraction; Image reconstruction; Contrastive learning; Transformers; Training; Pose estimation; Generators; Video representation learning; self-supervised learning; action recognition; human pose prediction;
D O I
10.1109/TIFS.2025.3531772
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by contrasting positive and negative video feature and I-frame feature pairs. The third branch is designed to generate both current and future video frames, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple popular video datasets.
引用
收藏
页码:1895 / 1907
页数:13
相关论文
共 94 条
[11]  
Chen T., 2020, PROC ICML
[12]   The Design of Reputation System for Blockchain-based Federated Learning [J].
Chen, Xinyan ;
Wang, Taotao ;
Zhang, Shengli .
2021 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BLOCKCHAIN TECHNOLOGY (AIBT 2021), 2021, :114-120
[13]   GOCA: Guided Online Cluster Assignment for Self-supervised Video Representation Learning [J].
Coskun, Huseyin ;
Zareian, Alireza ;
Moore, Joshua L. ;
Tombari, Federico ;
Wang, Chen .
COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 :1-22
[14]  
Das S., 2021, P IEEE TPAMI JAN, P15
[15]   VPN: Learning Video-Pose Embedding for Activities of Daily Living [J].
Das, Srijan ;
Sharma, Saurav ;
Dai, Rui ;
Bremond, Francois ;
Thonnat, Monique .
COMPUTER VISION - ECCV 2020, PT IX, 2020, 12354 :72-90
[16]  
Dave I. R., 2022, P CVIU MAR, V219, P9
[17]  
Diba A., 2021, P IEEE CVF INT C COM, P1492
[18]  
Diba A., 2019, P IEEE CVF INT C COM, P6200
[19]  
Ding S., 2022, PROC IEEE CVF C COMP
[20]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929