Unsupervised Feature Learning of Human Actions as Trajectories in Pose Embedding Manifold

被引:47
作者
Kundu, Jogendra Nath [1 ]
Gor, Maharshi [1 ]
Uppala, Phani Krishna [1 ]
Babu, R. Venkatesh [1 ]
机构
[1] Indian Inst Sci, CDS, Video Analyt Lab, Bangalore, Karnataka, India
来源
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2019年
关键词
D O I
10.1109/WACV.2019.00160
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
An unsupervised human action modeling framework can provide useful pose-sequence representation, which can be utilized in a variety of pose analysis applications. In this work we propose a novel temporal pose-sequence modeling framework, which can embed the dynamics of 3D human-skeleton joints to a continuous latent space in an efficient manner. In contrast to end-to-end framework explored by previous works, we disentangle the task of individual pose representation learning from the task of learning actions as a trajectory in pose embedding space. In order to realize a continuous pose embedding manifold with improved reconstructions, we propose an unsupervised, manifold learning procedure named Encoder GAN, (or EnGAN). Further we use the pose embeddings generated by EnGAN to model human actions using a bidirectional RNN auto-encoder architecture, PoseRNN. We introduce first-order gradient loss to explicitly enforce temporal regularity in the predicted motion sequence. A hierarchical feature fusion technique is also investigated for simultaneous modeling of local skeleton joints along with global pose variations. We demonstrate state-of-the-art transfer-ability of the learned representation against other supervisedly and unsupervisedly learned motion embeddings for the task of fine-grained action recognition on SBU interaction dataset. Further, we show the qualitative strengths of the proposed framework by visualizing skeleton pose reconstructions and interpolations in pose-embedding space, and low dimensional principal component projections of the reconstructed pose trajectories.
引用
收藏
页码:1459 / 1467
页数:9
相关论文
共 28 条
[1]  
Akhter I, 2015, PROC CVPR IEEE, P1446, DOI 10.1109/CVPR.2015.7298751
[2]  
[Anonymous], 2017, C COMP VIS PATT REC
[3]   Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition [J].
Du, Yong ;
Fu, Yun ;
Wang, Liang .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (07) :3010-3022
[4]  
Du Y, 2015, PROC CVPR IEEE, P1110, DOI 10.1109/CVPR.2015.7298714
[5]   Recurrent Network Models for Human Dynamics [J].
Fragkiadaki, Katerina ;
Levine, Sergey ;
Felsen, Panna ;
Malik, Jitendra .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4346-4354
[6]   Framewise phoneme classification with bidirectional LSTM and other neural network architectures [J].
Graves, A ;
Schmidhuber, J .
NEURAL NETWORKS, 2005, 18 (5-6) :602-610
[7]   CueS: Cueing for Upper Limb Rehabilitation in Stroke [J].
Holden, Amey ;
McNaney, Roisin ;
Balaam, Madeline ;
Thompson, Robin ;
Hammerla, Nils ;
Ploetz, Thomas ;
Jackson, Dan ;
Price, Christopher ;
Brkic, Lianne ;
Olivier, Patrick .
BRITISH HCI 2015, 2015, :18-25
[8]  
Hu Y., 2017, P BRIT MACH VIS C, P1
[9]  
Komura T., 2017, RECURRENT VARIATIONA, P7
[10]   Ensemble Deep Learning for Skeleton-based Action Recognition using Temporal Sliding LSTM networks [J].
Lee, Inwoong ;
Kim, Doyoung ;
Kang, Seoungyoon ;
Lee, Sanghoon .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1012-1020