A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data

被引：15

作者：

Nandakumar, Karthik ^{[1
]}

Wah, Wan Kong ^{[1
]}

Alice, Chan Siu Man ^{[1
]}

Terence, Ng Wen Zheng ^{[1
]}

Gang, Wang Jian ^{[1
]}

Yun, Yau Wei ^{[1
]}

机构：

[1] ASTAR, I2R, 1 Fusionopolis Way, Singapore, Singapore

来源：

ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2013年

关键词：

Multi-modal gesture recognition; log-energy features; Mel frequency cepstral coefficients (MFCC); Space-Time Interest Points (STIP); covariance descriptor; Hidden Markov Model (HMM); Support Vector Machine (SVM); fusion; NORMALIZATION;

D O I：

10.1145/2522848.2532593

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2, 742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.

引用

页码：475 / 482

页数：8

共 50 条

[41] A Generic Preprocessing Architecture for Multi-Modal IoT Sensor Data in Artificial General Intelligence
Dmytryk, Nicholas
Leivadeas, Aris
ELECTRONICS, 2022, 11 (22)
[42] Addressing the missing data challenge in multi-modal datasets for the diagnosis of Alzheimer?s disease
Aghili, Maryamossadat
Tabarestani, Solale
Adjouadi, Malek
JOURNAL OF NEUROSCIENCE METHODS, 2022, 375
[43] Dual structural consistency based multi-modal correlation propagation projections for data representation
Ji, Hong-Kun
Sun, Quan-Sen
Yuan, Yun-Hao
Ji, Ze-Xuan
Zhang, Guo-Qing
Feng, Lei
MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (20) : 20909 - 20933
[44] Challenges in Developing Prediction Models for Multi-modal High-Throughput Biomedical Data
Alzubaidi, Abeer
INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1, 2019, 868 : 1056 - 1069
[45] Corporate Relative Valuation Using Heterogeneous Multi-Modal Graph Neural Network
Yang, Yang
Yang, Jia-Qi
Bao, Ran
Zhan, De-Chuan
Zhu, Hengshu
Gao, Xiao-Ru
Xiong, Hui
Yang, Jian
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (01) : 211 - 224
[46] CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention
Li, Haibin
Guo, Aodi
Li, Yaqian
VISUAL COMPUTER, 2025, 41 (03) : 1609 - 1620
[47] Editorial paper for pattern recognition letters VSI on multi-view representation learning and multi-modal information representation
Song, Dan
Zhang, Wenshu
Ren, Tongwei
Chang, Xiaojun
PATTERN RECOGNITION LETTERS, 2022, 159 : 165 - 166
[48] An Intelligent Network Intrusion Detection System Based on Multi-Modal Support Vector Machines
Srinivasa, K. G.
INTERNATIONAL JOURNAL OF INFORMATION SECURITY AND PRIVACY, 2013, 7 (04) : 37 - 52
[49] Clustering versus Incremental Learning Multi-Codebook Fuzzy Neural Network for Multi-Modal Data Classification
Ma'sum, Muhammad Anwar
Sanabila, Hadaiq Rolis
Mursanto, Petrus
Jatmiko, Wisnu
COMPUTATION, 2020, 8 (01)
[50] Action recognition algorithm based on skeletal joint data and adaptive time pyramid
Sima, Mingjun
Hou, Mingzheng
Zhang, Xin
Ding, Jianwei
Feng, Ziliang
SIGNAL IMAGE AND VIDEO PROCESSING, 2022, 16 (06) : 1615 - 1622

← 1 2 3 4 5 →