Self-Supervised Scene-Debiasing for Video Representation Learning via Background Patching

被引:12
作者
Assefa, Maregu [1 ]
Jiang, Wei [1 ]
Gedamu, Kumie [2 ]
Yilma, Getinet [1 ]
Kumeda, Bulbula [1 ]
Ayalew, Melese [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Informat & Software Engn, Chengdu 610054, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610054, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; background patching; label smoothing; scene-debiasing; self-supervised learning; video representation; NETWORKS;
D O I
10.1109/TMM.2022.3193559
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Self-supervised learning has considerably improved video representation learning by discovering supervisory signals automatically from unlabeled videos. However, due to the scene-biased nature of existing video datasets, the current methods are biased to the dominant scene context during action inference. Hence, this paper proposes Background Patching (BP), a scene-debiasing augmentation strategy to alleviate the model reliance on the video background in a self-supervised contrastive manner. The BP reduces the negative influence of the video background by mixing a randomly patched frame to the video background. BP randomly crops four frames from four different videos and patches them to construct a new frame for each video separately. The patched frame is mixed with all frames of the target video to produce a spatially distorted video sample. Then, we use existing self-supervised contrastive frameworks to pull representations of the distorted and original videos closer together. Moreover, BP mixes the semantic labels of patches with the target video's label, resulting in the regularization of the contrastive model to soften the decision boundaries in the embedding space. Therefore, the model is explicitly constrained to suppress the background influence by emphasizing more on the motion changes. The extensive experimental results show that our BP significantly improved the performance of various video understanding downstream tasks including action recognition, action detection, and video retrieval.
引用
收藏
页码:5500 / 5515
页数:16
相关论文
共 63 条
[41]   VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples [J].
Pan, Tian ;
Song, Yibing ;
Yang, Tianyu ;
Jiang, Wenhao ;
Liu, Wei .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11200-11209
[42]   On Compositions of Transformations in Contrastive Self-Supervised Learning [J].
Patrick, Mandela ;
Asano, Yuki M. ;
Kuznetsova, Polina ;
Fong, Ruth ;
Henriques, Joao F. ;
Zweig, Geoffrey ;
Vedaldi, Andrea .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9557-9567
[43]   Spatiotemporal Contrastive Video Representation Learning [J].
Qian, Rui ;
Meng, Tianjian ;
Gong, Boqing ;
Yang, Ming-Hsuan ;
Wang, Huisheng ;
Belongie, Serge ;
Cui, Yin .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :6960-6970
[44]   Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].
Ren, Shaoqing ;
He, Kaiming ;
Girshick, Ross ;
Sun, Jian .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149
[45]   Interactive Video Retrieval in the Age of Deep Learning - Detailed Evaluation of VBS 2019 [J].
Rossetto, Luca ;
Gasser, Ralph ;
Lokoc, Jakub ;
Bailer, Werner ;
Schoeffmann, Klaus ;
Muenzer, Bernd ;
Soucek, Tomas ;
Nguyen, Phuong Anh ;
Bolettieri, Paolo ;
Leibetseder, Andreas ;
Vrochidis, Stefanos .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :243-256
[46]   Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization [J].
Selvaraju, Ramprasaath R. ;
Cogswell, Michael ;
Das, Abhishek ;
Vedantam, Ramakrishna ;
Parikh, Devi ;
Batra, Dhruv .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :618-626
[47]  
Shen ZQ, 2022, AAAI CONF ARTIF INTE, P2216
[48]  
Sohn K., 2020, Adv Neural Inf Process Syst, V33, P596
[49]  
Soomro K, 2012, Arxiv, DOI arXiv:1212.0402
[50]   Rethinking the Inception Architecture for Computer Vision [J].
Szegedy, Christian ;
Vanhoucke, Vincent ;
Ioffe, Sergey ;
Shlens, Jon ;
Wojna, Zbigniew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2818-2826