Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

被引:0
|
作者
Bi, Shuai [1 ]
Hu, Zhengping [1 ]
Zhao, Mengyao [1 ]
Zhang, Hehao [1 ]
Di, Jirui [1 ]
Sun, Zhe [1 ]
机构
[1] Yanshan Univ, Sch Informat Sci & Engn, West Hebei St 438, Qinhuangdao 066004, Peoples R China
基金
中国国家自然科学基金;
关键词
Unsupervised learning; Self-supervised learning; Pretext task learning; Multi-view contrastive learning; Action recognition;
D O I
10.1007/s11760-023-02605-z
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Self-supervised video representation learning attempts to extract latent spatiotemporal semantic information from unlabeled data that will be used for downstream visual understanding tasks. However, we found that in mainstream video datasets, the same actions may be marked as inconsistent categories in different environments. Therefore, it is crucial to concentrate on motion features and background areas while extracting the spatial and temporal characteristics of the video. This paper presents a self-supervised action recognition framework to learn the dynamic-static features of video by combining the pretext task with cross-view contrastive learning. Specifically, we first introduce a video cloze procedure pretext task that exploits temporally strong correlations to obtain prediction categories for further supervised information generation. Next, multi-view contrastive learning is proposed to extract motion characteristics and global semantic information from consecutive video frames. Through joint optimization of the pretext task and multiple contrast losses, our method demonstrates that the recognition accuracy on the UCF101 and HMDB51 datasets is 1.2% and 0.8% higher than the highest accuracy obtained by using residual contrastive and 1.3% and 0.4% higher than that obtained by using RGB contrastive only. Experimental results with different datasets and backbone networks demonstrate that our proposal can significantly increase the generalization and robustness of the model.
引用
收藏
页码:3775 / 3782
页数:8
相关论文
共 50 条
  • [31] Multi-view and multi-augmentation for self-supervised visual representation learning
    Van Nhiem Tran
    Chi-En Huang
    Shen-Hsuan Liu
    Muhammad Saqlain Aslam
    Kai-Lin Yang
    Yung-Hui Li
    Jia-Ching Wang
    Applied Intelligence, 2024, 54 : 629 - 656
  • [32] GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning
    Fang, Uno
    Li, Jianxin
    Akhtar, Naveed
    Li, Man
    Jia, Yan
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (04): : 1667 - 1683
  • [33] GoMIC: Multi-view image clustering via self-supervised contrastive heterogeneous graph co-learning
    Uno Fang
    Jianxin Li
    Naveed Akhtar
    Man Li
    Yan Jia
    World Wide Web, 2023, 26 : 1667 - 1683
  • [34] Self-Supervised Discriminative Feature Learning for Deep Multi-View Clustering
    Xu, Jie
    Ren, Yazhou
    Tang, Huayi
    Yang, Zhimeng
    Pan, Lili
    Yang, Yang
    Pu, Xiaorong
    Yu, Philip S.
    He, Lifang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (07) : 7470 - 7482
  • [35] Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition
    Wu, Yushan
    Xu, Zengmin
    Yuan, Mengwei
    Tang, Tianchi
    Meng, Ruxing
    Wang, Zhongyuan
    MULTIMEDIA SYSTEMS, 2024, 30 (05)
  • [36] Contrastive Self-Supervised Learning for Optical Music Recognition
    Penarrubia, Carlos
    Valero-Mas, Jose J.
    Calvo-Zaragoza, Jorge
    DOCUMENT ANALYSIS SYSTEMS, DAS 2024, 2024, 14994 : 312 - 326
  • [37] Multi-view representation learning for multi-view action recognition
    Hao, Tong
    Wu, Dan
    Wang, Qian
    Sun, Jin-Sheng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2017, 48 : 453 - 460
  • [38] MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION
    Ravanelli, Mirco
    Zhong, Jianyuan
    Pascual, Santiago
    Swietojanski, Pawel
    Monteiro, Joao
    Trmal, Jan
    Bengio, Yoshua
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6989 - 6993
  • [39] Multi-view Self-supervised Heterogeneous Graph Embedding
    Zhao, Jianan
    Wen, Qianlong
    Sun, Shiyu
    Ye, Yanfang
    Zhang, Chuxu
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT II, 2021, 12976 : 319 - 334
  • [40] Self-Supervised Deep Multi-View Subspace Clustering
    Sun, Xiukun
    Cheng, Miaomiao
    Min, Chen
    Jing, Liping
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 1001 - 1016