TMMF: Temporal Multi-Modal Fusion for Single-Stage Continuous Gesture Recognition

被引：31

作者：

Gammulle, Harshala ^{[1
]}

Denman, Simon ^{[1
]}

Sridharan, Sridha ^{[1
]}

Fookes, Clinton ^{[1
]}

机构：

[1] Queensland Univ Technol, Signal Proc Artificial Intelligence & Vis Technol, Brisbane, Qld 4000, Australia

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2021年 / 30卷

基金：

澳大利亚研究理事会;

关键词：

Gesture recognition; Feature extraction; Solid modeling; Streaming media; Semantics; Visualization; Three-dimensional displays; spatio-temporal representation learning; temporal convolution networks;

D O I：

10.1109/TIP.2021.3108349

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have focused on recognising isolated gestures, and existing continuous gesture recognition methods are limited to two-stage approaches where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that can detect and classify multiple gestures in a video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step to detect individual gestures. To achieve this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance performance, we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction, helping the model to learn natural gesture transitions. We demonstrate the utility of our proposed framework, which can handle variable-length input videos, and outperforms the state-of-the-art on three challenging datasets: EgoGesture, IPN hand and ChaLearn LAP Continuous Gesture Dataset (ConGD). Furthermore, ablation experiments show the importance of different components of the proposed framework.

引用

页码：7689 / 7701

页数：13

共 50 条

[1] Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training
Abavisani, Mahdi
Joze, Hamid Reza Vaezi
Patel, Vishal M.
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1165 - 1174
[2] MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
Abu Farha, Yazan
Gall, Juergen
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3570 - 3579
[3] Benitez-Garcia G., 2020, ARXIV200502134
[4] Improving Real-Time Hand Gesture Recognition with Semantic Segmentation
Benitez-Garcia, Gibran
Prudente-Tixteco, Lidia
Castro-Madrid, Luis Carlos
Toscano-Medina, Rocio
Olivares-Mercado, Jesus
Sanchez-Perez, Gabriel
Villalba, Luis Javier Garcia
[J]. SENSORS, 2021, 21 (02) : 1 - 16
[5] Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition
Camgoz, Necati Cihan
Hadfield, Simon
Bowden, Richard
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 3079 - 3085
[6] Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules
Cao, Congqi
Zhang, Yifan
Wu, Yi
Lu, Hanqing
Cheng, Jian
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 3783 - 3791
[7] Chen M., 2017, P 19 ACM INT C MULT, P163, DOI DOI 10.1145/3136755.3136801
[8] Predicting the Future: A Jointly Learnt Model for Action Anticipation
Gammulle, Harshala
Denman, Simon
Sridharan, Sridha
Fookes, Clinton
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5561 - 5570
[9] Fine-grained action segmentation using the semi-supervised action GAN
Gammulle, Harshala
Denman, Simon
Sridharan, Sridha
Fookes, Clinton
[J]. PATTERN RECOGNITION, 2020, 98
[10] Coupled Generative Adversarial Network for Continuous Fine-grained Action Segmentation
Gammulle, Harshala
Fernando, Tharindu
Denman, Simon
Sridharan, Sridha
Fookes, Clinton
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 200 - 209

← 1 2 3 4 5 →