Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

被引:123
|
作者
Wang, Jiangliu [1 ,3 ]
Jiao, Jianbo [2 ,3 ]
Bao, Linchao [3 ]
He, Shengfeng [4 ]
Liu, Yunhui [1 ]
Liu, Wei [3 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Univ Oxford, Oxford, England
[3] Tencent AI Lab, Bellevue, WA 98004 USA
[4] South China Univ Technol, Guangzhou, Peoples R China
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1109/CVPR.2019.00413
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_inas.
引用
收藏
页码:4001 / 4010
页数:10
相关论文
共 50 条
  • [31] Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition
    Nie, Jie
    Wang, Xin
    Hou, Runze
    Li, Guohao
    Chen, Hong
    Zhu, Wenwu
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4145 - 4158
  • [32] Self-Supervised Video GANs: Learning for Appearance Consistency and Motion Coherency
    Hyun, Sangeek
    Kim, Jihwan
    Heo, Jae-Pil
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10821 - 10830
  • [33] Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency
    Qing, Zhiwu
    Zhang, Shiwei
    Huang, Ziyuan
    Xu, Yi
    Wang, Xiang
    Tang, Mingqian
    Gao, Changxin
    Jin, Rong
    Sang, Nong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13811 - 13821
  • [34] SCENE REPRESENTATION LEARNING FROM VIDEOS USING SELF-SUPERVISED AND WEAKLY-SUPERVISED TECHNIQUES
    Peri, Raghuveer
    Parthasarathy, Srinivas
    Sundaram, Shiva
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1671 - 1675
  • [35] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [36] Self-supervised Representation Learning from Videos for Facial Action Unit Detection
    Li, Yong
    Zeng, Jiabei
    Shan, Shiguang
    Chen, Xilin
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10916 - 10925
  • [37] ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency
    Huang, Deng
    Wu, Wenhao
    Hu, Weiwen
    Liu, Xu
    He, Dongliang
    Wu, Zhihua
    Wu, Xiangmiao
    Tan, Mingkui
    Ding, Errui
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8076 - 8085
  • [38] Whitening for Self-Supervised Representation Learning
    Ermolov, Aleksandr
    Siarohin, Aliaksandr
    Sangineto, Enver
    Sebe, Nicu
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [39] Enhancing motion visual cues for self-supervised video representation learning
    Nie, Mu
    Quan, Zhibin
    Ding, Weiping
    Yang, Wankou
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [40] TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning
    Liu, Yang
    Wang, Keze
    Liu, Lingbo
    Lan, Haoyuan
    Lin, Liang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1978 - 1993