DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition

被引:13
作者
Gao, Guangyu [1 ]
Liu, Ziming [2 ,3 ]
Zhang, Guangjun [1 ]
Li, Jinyang [1 ]
Qin, A. K. [4 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 100081, Peoples R China
[2] Univ Cote Azur, ACENTAURI Team, INRIA, F-06902 Sophia Antipolis, France
[3] Univ Cote Azur, 3IA Inst, F-06902 Sophia Antipolis, France
[4] Swinburne Univ Technol, Dept Comp Technol, Hawthorn, Vic 3122, Australia
基金
澳大利亚研究理事会; 中国国家自然科学基金;
关键词
Action recognition; Semi-supervised learning; Contrastive loss; Unannotated video;
D O I
10.1016/j.neunet.2022.11.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Action Recognition (ViAR) aims to identify the category of the human action observed in a given video. With the advent of Deep Learning (DL) techniques, noticeable performance breakthroughs have been achieved in this study. However, the success of most existing DL-based ViAR methods heavily relies on the existence of a large amount of annotated data, i.e., videos with corresponding action categories. In practice, obtaining such a desired number of annotations is often difficult due to expensive labeling costs, which may lead to significant performance degradation for these methods. To address this issue, we propose an end-to-end semi-supervised Differentiated Auxiliary guided Network (DANet) to best use a few annotated videos. Except for the common supervised learning on a few annotated videos, the DANet also involves the knowledge of multiple pre-trained auxiliary networks to optimize the ViAR network in a self-supervised way on the unannotated data by removing the annotations. Considering the tight connection between video action recognition and classical static image-based visual tasks, the abundant knowledge from the pre-trained static image-based models can be used for training the ViAR model. Specifically, the DANet is a two-branch architecture, which includes a target branch of the ViAR network, and an auxiliary branch of multiple auxiliary networks (i.e., referring to diverse off-the-shelf models of relevant image tasks). Given a limited number of annotated videos, we train the target ViAR network end-to-end in a semi-supervised way, namely, with both the supervised cross-entropy loss on annotated videos, and the per-auxiliary weighted self-supervised contrastive losses on the same videos but without using annotations. Besides, we further explore different weighted guidance of the auxiliary networks to the ViAR network to better reflect different relationships between the image-based models and the ViAR model. Finally, we conduct extensive experiments on several popular action recognition benchmarks in comparison with existing state-of-the-art methods, and the experimental results demonstrate the superiority of DANet over most of the compared methods. In particular, the DANet obviously suppresses state-of-the-art ViAR methods even with very fewer annotated videos. (c) 2022 Elsevier Ltd. All rights reserved.
引用
收藏
页码:121 / 131
页数:11
相关论文
共 58 条
[1]  
Ahsan U., 2018, arXiv
[2]   Long Short View Feature Decomposition via Contrastive Video Representation Learning [J].
Behrmann, Nadine ;
Fayyaz, Mohsen ;
Gall, Juergen ;
Noroozi, Mehdi .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9224-9233
[3]   Multi-View Super Vector for Action Recognition [J].
Cai, Zhuowei ;
Wang, Limin ;
Peng, Xiaojiang ;
Qiao, Yu .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :596-603
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]  
Chen T, 2020, PR MACH LEARN RES, V119
[6]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7]   DynamoNet: Dynamic Action and Motion Network [J].
Diba, Ali ;
Sharma, Vivek ;
Van Gool, Luc ;
Stiefelhagen, Rainer .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6191-6200
[8]   Spatio-temporal Channel Correlation Networks for Action Classification [J].
Diba, Ali ;
Fayyaz, Mohsen ;
Sharma, Vivek ;
Arzani, M. Mahdi ;
Yousefzadeh, Rahman ;
Gall, Juergen ;
Van Gool, Luc .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :299-315
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]   Omni-Sourced Webly-Supervised Learning for Video Recognition [J].
Duan, Haodong ;
Zhao, Yue ;
Xiong, Yuanjun ;
Liu, Wentao ;
Lin, Dahua .
COMPUTER VISION - ECCV 2020, PT XV, 2020, 12360 :670-688