Multi-label zero-shot human action recognition via joint latent ranking embedding

被引:31
作者
Wang, Qian [1 ]
Chen, Ke [1 ]
机构
[1] Univ Manchester, Dept Comp Sci, Manchester, Lancs, England
关键词
Human action recognition; Multi-label learning; Zero-shot learning; Joint latent ranking embedding; Weakly supervised learning;
D O I
10.1016/j.neunet.2019.09.029
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human action recognition is one of the most challenging tasks in computer vision. Most of the existing works in human action recognition are limited to single-label classification. A real-world video stream, however, often contains multiple human actions. Such a video stream is usually annotated collectively with a set of relevant human action labels, which leads to a multi-label learning problem. Furthermore, there are a great number of meaningful human actions in reality but it would be extremely difficult, if not impossible, to collect/annotate sufficient video clips regarding all these human actions for training a supervised learning model. In this paper, we formulate a real-world human action recognition task as a multi-label zero-shot learning problem. To address this problem, a joint latent ranking embedding framework is proposed. Our framework holistically tackles the issue of unknown temporal boundaries between different actions within a video clip for multi-label learning and exploits the side information regarding the semantic relationship between different human actions for zero-shot learning. Specifically, our framework consists of two component neural networks for visual and semantic embedding respectively. Thus, multi-label zero-shot recognition is done by measuring relatedness scores of concerned action labels to a test video clip in the joint latent visual and semantic embedding spaces. We evaluate our framework in different settings, including a novel data split scheme designed especially for evaluating multi-label zero-shot learning. The experimental results on two weakly annotated multi-label human action datasets (i.e. Breakfast and Charades) demonstrate the effectiveness of our framework. Crown Copyright (c) 2019 Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:1 / 23
页数:23
相关论文
共 68 条
[1]  
Abu-El-Haija S., 2016, ARXIV160908675
[2]  
Akata Z, 2015, PROC CVPR IEEE, P2927, DOI 10.1109/CVPR.2015.7298911
[3]  
[Anonymous], 2010, ARXIV10103467
[4]  
[Anonymous], BMVC
[5]  
[Anonymous], 2014, C TRACK P
[6]  
[Anonymous], P 3 INT C LEARN REPR
[7]  
[Anonymous], 1997, Neural Computation
[8]  
[Anonymous], P BRIT MACH VIS C
[9]  
[Anonymous], 2016, ARXIV160600282
[10]  
[Anonymous], 2016, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2016.649