Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition

被引：341

作者：

Wu, Di ^{[1
]}

Pigou, Lionel ^{[2
]}

Kindermans, Pieter-Jan ^{[3
]}

Nam Do-Hoang Le ^{[4
]}

Shao, Ling ^{[5
]}

Dambre, Joni ^{[2
]}

Odobez, Jean-Marc ^{[4
]}

机构：

[1] IDIAP, Percept & Act Understanding, Martigny, Valais, Switzerland

[2] Univ Ghent, ELIS, Ghent, Oost Vlaanderen, Belgium

[3] TU Berlin, Machine Learning Grp, Berlin, Germany

[4] IDIAP Res Inst, Comp Vis, Martigny, Valais, Switzerland

[5] Northumbria Univ, Dept Comp Sci & Digital Technol, Newcastle Upon Tyne, Tyne & Wear, England

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2016年 / 38卷 / 08期

基金：

欧盟地平线“2020”; 中国国家自然科学基金;

关键词：

Deep learning; convolutional neural networks; deep belief networks; hidden Markov models; gesture recognition;

D O I：

10.1109/TPAMI.2016.2537340

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper describes a novel method called Deep Dynamic Neural Networks (DDNN) for multimodal gesture recognition. A semi-supervised hierarchical dynamic framework based on a Hidden Markov Model (HMM) is proposed for simultaneous gesture segmentation and recognition where skeleton joint information, depth and RGB images, are the multimodal input observations. Unlike most traditional approaches that rely on the construction of complex handcrafted features, our approach learns high-level spatiotemporal representations using deep neural networks suited to the input modality: a Gaussian-Bernouilli Deep Belief Network (DBN) to handle skeletal dynamics, and a 3D Convolutional Neural Network (3DCNN) to manage and fuse batches of depth and RGB images. This is achieved through the modeling and learning of the emission probabilities of the HMM required to infer the gesture sequence. This purely data driven approach achieves a Jaccard index score of 0.81 in the ChaLearn LAP gesture spotting challenge. The performance is on par with a variety of state-of-the-art hand-tuned feature-based approaches and other learning-based methods, therefore opening the door to the use of deep learning techniques in order to further explore multimodal time series data.

引用

页码：1583 / 1597

页数：15

共 67 条

[1]

[Anonymous], 2014, ARXIV150100102

[2]

[Anonymous], 2006, Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 2, Washington, DC, USA

[3]

[Anonymous], 2014, WORKSH EUR C COMP VI

[4]

[Anonymous], THESIS

[5]

[Anonymous], 2012, P SIGCHI C HUM FACT

[6]

[Anonymous], 2013, P ACM CHAL MULT MOD

[7]

[Anonymous], 2014, P EUR C COMP VIS

[8]

[Anonymous], 1994, Connectionist Speech Recognition: A Hybrid Approach

[9]

[Anonymous], 2014, ARXIV14047828

[10]

[Anonymous], 2012, Advances in Neural Information Processing Systems

← 1 2 3 4 5 6 7 →