Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data

被引：2

作者：

Woods, Luke T. ^{[1
,2
]}

Rana, Zeeshan A. ^{[3
]}

机构：

[1] Cranfield Univ, Digital Aviat Res & Technol Ctr DARTeC, Cranfield MK43 0AL, England

[2] Leidos Ind Engineers Ltd, Unit 3,Bedford Link Logist Pk,Bell Farm Way, Kempston MK43 9SS, Beds, England

[3] Cranfield Univ, Ctr Aeronaut, Sch Aerosp Transport & Mfg SATM, Cranfield MK43 0AL, Beds, England

来源：

MATHEMATICS | 2023年 / 11卷 / 09期

关键词：

sign language recognition; human pose estimation; classification; computer vision; deep learning; machine learning; supervised learning; EYE GAZE; RECOGNITION; HAND;

D O I：

10.3390/math11092129

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

We present a study on modelling American Sign Language (ASL) with encoder-only transformers and human pose estimation keypoint data. Using an enhanced version of the publicly available Word-level ASL (WLASL) dataset, and a novel normalisation technique based on signer body size, we show the impact model architecture has on accurately classifying sets of 10, 50, 100, and 300 isolated, dynamic signs using two-dimensional keypoint coordinates only. We demonstrate the importance of running and reporting results from repeated experiments to describe and evaluate model performance. We include descriptions of the algorithms used to normalise the data and generate the train, validation, and test data splits. We report top-1, top-5, and top-10 accuracy results, evaluated with two separate model checkpoint metrics based on validation accuracy and loss. We find models with fewer than 100k learnable parameters can achieve high accuracy on reduced vocabulary datasets, paving the way for lightweight consumer hardware to perform tasks that are traditionally resource-intensive, requiring expensive, high-end equipment. We achieve top-1, top-5, and top-10 accuracies of 97%, 100%, and 100%, respectively, on a vocabulary size of 10 signs; 87%, 97%, and 98% on 50 signs; 83%, 96%, and 97% on 100 signs; and 71%, 90%, and 94% on 300 signs, thereby setting a new benchmark for this task.

引用

页数：28

共 103 条

[1] Allela R., 2019, SIGN IO
[2] [Anonymous], 2012, P IEEE COMP SOC C CO
[3] [Anonymous], LOGSOFTMAX PYTORCH 1
[4] [Anonymous], CrossEntropyLoss - PyTorch 2.0 documentation
[5] [Anonymous], EMBEDDING PYTORCH 1
[6] [Anonymous], ADAM PYTORCH 1 9 0 D
[7] [Anonymous], COSINEANNEALINGWARMR
[8] Antonakos Epameinondas, 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), P1, DOI 10.1109/FG.2015.7163162
[9] Classification of extreme facial events in sign language videos
Antonakos, Epameinondas
Pitsikalis, Vassilis
Maragos, Petros
[J]. EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2014,
[10] Avalos J.M.L., 2016, ENG DEV SYSTEM SIGN

← 1 2 3 4 5 6 7 8 9 10 →