GAN for vision, KG for relation: A two-stage network for zero-shot action recognition

被引:16
作者
Sun, Bin [1 ]
Kong, Dehui [1 ]
Wang, Shaofan [1 ]
Li, Jinghua [1 ]
Yin, Baocai [1 ]
Luo, Xiaonan [1 ]
机构
[1] Beijing Univ Technol, Fac Informat Technol, Beijing Inst Artificial Intelligence, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
基金
北京市自然科学基金; 美国国家科学基金会; 国家重点研发计划;
关键词
Action recognition; Zero-shot learning; Generative adversarial networks; Graph convolution network;
D O I
10.1016/j.patcog.2022.108563
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Zero-shot action recognition can recognize samples of unseen classes that are unavailable in training by exploring common latent semantic representation in samples. However, most methods neglected the connotative relation and extensional relation between the action classes, which leads to the poor generalization ability of the zero-shot learning. Furthermore, the learned classifier inclines to predict the samples of seen class, which leads to poor classification performance. To solve the above problems, we propose a two-stage deep neural network for zero-shot action recognition, which consists of a feature generation sub-network serving as the sampling stage and a graph attention sub-network serving as the classification stage. In the sampling stage, we utilize generative adversarial networks (GAN) trained by action features and word vectors of seen classes to synthesize the action features of unseen classes, which can balance the training sample data of seen classes and unseen classes. In the classification stage, we construct a knowledge graph (KG) based on the relationship between word vectors of action classes and related objects, and propose a graph convolution network (GCN) based on attention mechanism, which dynamically updates the relationship between action classes and objects, and enhances the generalization ability of zero-shot learning. In both stages, we all use word vectors as bridges for feature generation and classifier generalization from seen classes to unseen classes. We compare our method with state-of-theart methods on UCF101 and HMDB51 datasets. Experimental results show that our proposed method improves the classification performance of the trained classifier and achieves higher accuracy. (c) 2022 Elsevier Ltd. All rights reserved.
引用
收藏
页数:10
相关论文
共 40 条
[1]  
Akata Z, 2015, PROC CVPR IEEE, P2927, DOI 10.1109/CVPR.2015.7298911
[2]   Label-Embedding for Attribute-Based Classification [J].
Akata, Zeynep ;
Perronnin, Florent ;
Harchaoui, Zaid ;
Schmid, Cordelia .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :819-826
[3]  
Alexiou I, 2016, IEEE IMAGE PROC, P4190, DOI 10.1109/ICIP.2016.7533149
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   Predicting Visual Exemplars of Unseen Classes for Zero-Shot Learning [J].
Changpinyo, Soravit ;
Chao, Wei-Lun ;
Sha, Fei .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3496-3505
[6]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[7]   Multi-modal Cycle-Consistent Generalized Zero-Shot Learning [J].
Felix, Rafael ;
Kumar, B. G. Vijay ;
Reid, Ian ;
Carneiro, Gustavo .
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 :21-37
[8]  
Fu YW, 2014, LECT NOTES COMPUT SC, V8690, P584, DOI 10.1007/978-3-319-10605-2_38
[9]   Learning Multimodal Latent Attributes [J].
Fu, Yanwei ;
Hospedales, Timothy M. ;
Xiang, Tao ;
Gong, Shaogang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (02) :303-316
[10]  
Gan C, 2016, AAAI CONF ARTIF INTE, P3487