TMMF: Temporal Multi-Modal Fusion for Single-Stage Continuous Gesture Recognition

被引:31
作者
Gammulle, Harshala [1 ]
Denman, Simon [1 ]
Sridharan, Sridha [1 ]
Fookes, Clinton [1 ]
机构
[1] Queensland Univ Technol, Signal Proc Artificial Intelligence & Vis Technol, Brisbane, Qld 4000, Australia
基金
澳大利亚研究理事会;
关键词
Gesture recognition; Feature extraction; Solid modeling; Streaming media; Semantics; Visualization; Three-dimensional displays; spatio-temporal representation learning; temporal convolution networks;
D O I
10.1109/TIP.2021.3108349
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have focused on recognising isolated gestures, and existing continuous gesture recognition methods are limited to two-stage approaches where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF), that can detect and classify multiple gestures in a video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step to detect individual gestures. To achieve this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance performance, we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction, helping the model to learn natural gesture transitions. We demonstrate the utility of our proposed framework, which can handle variable-length input videos, and outperforms the state-of-the-art on three challenging datasets: EgoGesture, IPN hand and ChaLearn LAP Continuous Gesture Dataset (ConGD). Furthermore, ablation experiments show the importance of different components of the proposed framework.
引用
收藏
页码:7689 / 7701
页数:13
相关论文
共 50 条
  • [1] Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training
    Abavisani, Mahdi
    Joze, Hamid Reza Vaezi
    Patel, Vishal M.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1165 - 1174
  • [2] MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
    Abu Farha, Yazan
    Gall, Juergen
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3570 - 3579
  • [3] Benitez-Garcia G., 2020, ARXIV200502134
  • [4] Improving Real-Time Hand Gesture Recognition with Semantic Segmentation
    Benitez-Garcia, Gibran
    Prudente-Tixteco, Lidia
    Castro-Madrid, Luis Carlos
    Toscano-Medina, Rocio
    Olivares-Mercado, Jesus
    Sanchez-Perez, Gabriel
    Villalba, Luis Javier Garcia
    [J]. SENSORS, 2021, 21 (02) : 1 - 16
  • [5] Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition
    Camgoz, Necati Cihan
    Hadfield, Simon
    Bowden, Richard
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 3079 - 3085
  • [6] Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules
    Cao, Congqi
    Zhang, Yifan
    Wu, Yi
    Lu, Hanqing
    Cheng, Jian
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 3783 - 3791
  • [7] Chen M., 2017, P 19 ACM INT C MULT, P163, DOI DOI 10.1145/3136755.3136801
  • [8] Predicting the Future: A Jointly Learnt Model for Action Anticipation
    Gammulle, Harshala
    Denman, Simon
    Sridharan, Sridha
    Fookes, Clinton
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5561 - 5570
  • [9] Fine-grained action segmentation using the semi-supervised action GAN
    Gammulle, Harshala
    Denman, Simon
    Sridharan, Sridha
    Fookes, Clinton
    [J]. PATTERN RECOGNITION, 2020, 98
  • [10] Coupled Generative Adversarial Network for Continuous Fine-grained Action Segmentation
    Gammulle, Harshala
    Fernando, Tharindu
    Denman, Simon
    Sridharan, Sridha
    Fookes, Clinton
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 200 - 209