A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data

被引:15
作者
Nandakumar, Karthik [1 ]
Wah, Wan Kong [1 ]
Alice, Chan Siu Man [1 ]
Terence, Ng Wen Zheng [1 ]
Gang, Wang Jian [1 ]
Yun, Yau Wei [1 ]
机构
[1] ASTAR, I2R, 1 Fusionopolis Way, Singapore, Singapore
来源
ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2013年
关键词
Multi-modal gesture recognition; log-energy features; Mel frequency cepstral coefficients (MFCC); Space-Time Interest Points (STIP); covariance descriptor; Hidden Markov Model (HMM); Support Vector Machine (SVM); fusion; NORMALIZATION;
D O I
10.1145/2522848.2532593
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2, 742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.
引用
收藏
页码:475 / 482
页数:8
相关论文
共 50 条
  • [41] A Generic Preprocessing Architecture for Multi-Modal IoT Sensor Data in Artificial General Intelligence
    Dmytryk, Nicholas
    Leivadeas, Aris
    ELECTRONICS, 2022, 11 (22)
  • [42] Addressing the missing data challenge in multi-modal datasets for the diagnosis of Alzheimer?s disease
    Aghili, Maryamossadat
    Tabarestani, Solale
    Adjouadi, Malek
    JOURNAL OF NEUROSCIENCE METHODS, 2022, 375
  • [43] Dual structural consistency based multi-modal correlation propagation projections for data representation
    Ji, Hong-Kun
    Sun, Quan-Sen
    Yuan, Yun-Hao
    Ji, Ze-Xuan
    Zhang, Guo-Qing
    Feng, Lei
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (20) : 20909 - 20933
  • [44] Challenges in Developing Prediction Models for Multi-modal High-Throughput Biomedical Data
    Alzubaidi, Abeer
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1, 2019, 868 : 1056 - 1069
  • [45] Corporate Relative Valuation Using Heterogeneous Multi-Modal Graph Neural Network
    Yang, Yang
    Yang, Jia-Qi
    Bao, Ran
    Zhan, De-Chuan
    Zhu, Hengshu
    Gao, Xiao-Ru
    Xiong, Hui
    Yang, Jian
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (01) : 211 - 224
  • [46] CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention
    Li, Haibin
    Guo, Aodi
    Li, Yaqian
    VISUAL COMPUTER, 2025, 41 (03) : 1609 - 1620
  • [47] Editorial paper for pattern recognition letters VSI on multi-view representation learning and multi-modal information representation
    Song, Dan
    Zhang, Wenshu
    Ren, Tongwei
    Chang, Xiaojun
    PATTERN RECOGNITION LETTERS, 2022, 159 : 165 - 166
  • [48] An Intelligent Network Intrusion Detection System Based on Multi-Modal Support Vector Machines
    Srinivasa, K. G.
    INTERNATIONAL JOURNAL OF INFORMATION SECURITY AND PRIVACY, 2013, 7 (04) : 37 - 52
  • [49] Clustering versus Incremental Learning Multi-Codebook Fuzzy Neural Network for Multi-Modal Data Classification
    Ma'sum, Muhammad Anwar
    Sanabila, Hadaiq Rolis
    Mursanto, Petrus
    Jatmiko, Wisnu
    COMPUTATION, 2020, 8 (01)
  • [50] Action recognition algorithm based on skeletal joint data and adaptive time pyramid
    Sima, Mingjun
    Hou, Mingzheng
    Zhang, Xin
    Ding, Jianwei
    Feng, Ziliang
    SIGNAL IMAGE AND VIDEO PROCESSING, 2022, 16 (06) : 1615 - 1622