Multimodal Fusion Framework Based on Statistical Attention and Contrastive Attention for Sign Language Recognition

被引：19

作者：

Zhang, Jiangtao ^{[1
]}

Wang, Qingshan ^{[1
]}

Wang, Qi ^{[1
]}

Zheng, Zhiwen ^{[1
]}

机构：

[1] Hefei Univ Technol, Sch Math, Hefei 230601, Anhui, Peoples R China

来源：

IEEE TRANSACTIONS ON MOBILE COMPUTING | 2024年 / 23卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Gesture recognition; Assistive technologies; Feature extraction; Skeleton; Hidden Markov models; Motion detection; Robot sensing systems; Sign language recognition; wearable computing; multimodal fusion; sEMG; deep learning; LAPLACIAN OPERATOR; FIELD; SHAPE;

D O I：

10.1109/TMC.2023.3235935

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sign language recognition (SLR) enables hearing-impaired people to better communicate with able-bodied individuals. The diversity of multiple modalities can be utilized to improve SLR. However, existing multimodal fusion methods do not take into account multimodal interrelationships in-depth. This paper proposes SeeSign: a multimodal fusion framework based on statistical attention and contrastive attention for SLR. The designed two attention mechanisms are used to investigate intra-modal and inter-modal correlations of surface Electromyography (sEMG) and inertial measurement unit (IMU) signals, and fuse the two modalities. Statistical attention uses the Laplace operator and lower quantile to select and enhance active features within each modal feature clip. Contrastive attention calculates the information gain of active features in a couple of enhanced feature clips located at the same position in two modalities. The enhanced feature clips are then fused in their positions based on the gain. The fused multimodal features are fed into a Transformer-based network with connectionist temporal classification and cross-entropy losses for SLR. The experimental results show that SeeSign has accuracy of 93.17% for isolated words, and word error rates of 18.34% and 22.08% on one-handed and two-handed sign language datasets, respectively. Moreover, it outperforms state-of-the-art methods in terms of accuracy and robustness.

引用

页码：1431 / 1443

页数：13

共 54 条

[1]

Ananthanarayana T, 2021, IEEE INT CONF AUTOMA

[2]

Boukhechba M., 2019, Smart Health, V14, P100082, DOI [DOI 10.1016/J.SMHL.2019.100082, 10.1016/j.smhl.2019.100082]

[3] SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition [J].

Camgoz, Necati Cihan ;

Hadfield, Simon ;

Koller, Oscar ;

Bowden, Richard .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3075-3084

[4]

CAPHD CDPF and NALBRC, 2019, National dictionary of general sign language

[5] Hand Gesture Recognition based on Surface Electromyography using Convolutional Neural Network with Transfer Learning Method [J].

Chen, Xiang ;

Li, Yu ;

Hu, Ruochen ;

Zhang, Xu ;

Chen, Xun .

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2021, 25 (04) :1292-1304

[6] Homogenization of temperature field and temperature gradient field [J].

Cheng XueTao ;

Xu XiangHua ;

Liang XinGang .

SCIENCE IN CHINA SERIES E-TECHNOLOGICAL SCIENCES, 2009, 52 (10) :2937-2942

[7]

Cooper H, 2017, SPRING SER CHALLENGE, P89, DOI 10.1007/978-3-319-57021-1_3

[8]

Diebel J., 2006, MATRIX, V58, P1

[9]

Dong C, 2015, IEEE COMPUT SOC CONF

[10]

Ekiz D, 2017, SIG PROCESS COMMUN

← 1 2 3 4 5 6 →