Static hand gesture recognition method based on the Vision Transformer

被引：4

作者：

Zhang, Yu ^{[1
]}

Wang, Junlin ^{[1
]}

Wang, Xin ^{[1
]}

Jing, Haonan ^{[1
]}

Sun, Zhanshuo ^{[1
]}

Cai, Yu ^{[1
]}

机构：

[1] Inner Mongolia Univ, Coll Elect Informat Engn, Hohhot 010021, Peoples R China

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 82卷 / 20期

基金：

中国国家自然科学基金;

关键词：

Hand gesture recognition; Vision Transformer; Arm removal; Data augmentation; CONVOLUTIONAL NEURAL-NETWORKS; AMERICAN SIGN-LANGUAGE;

D O I：

10.1007/s11042-023-14732-3

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Hand gesture recognition (HGR) is the most important part of human-computer interaction (HCI). Static hand gesture recognition is equivalent to the classification of hand gesture images. At present, the classification of hand gesture images mainly uses the Convolutional Neural Network (CNN) method. The Vision Transformer architecture (ViT) proposes not to use the convolutional layers at all but to use the multi-head attention mechanism to learn global information. Therefore, this paper proposes a static hand gesture recognition method based on the Vision Transformer. This paper uses a self-made dataset and two publicly available American Sign Language (ASL) datasets to train and evaluate the ViT architecture. Using the depth information provided by the Microsoft Kinect camera to capture the hand gesture images and filter the background, then use the eight-connected discrimination algorithm and the distance transformation algorithm to remove the redundant arm information. The resulting images constitute a self-made dataset. At the same time, this paper studies the impact of several data augmentation strategies on recognition performance. This paper uses accuracy, F1 score, recall, and precision as evaluation metrics. Finally, the validation accuracy of the proposed model on the three datasets achieves 99.44%, 99.37%, and 96.53%, respectively, and the results obtained are better than those obtained by some CNN structures.

引用

页码：31309 / 31328

页数：20

共 53 条

[1]

Alani AA, 2018, 2018 4TH INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT (ICIM2018), P5, DOI 10.1109/INFOMAN.2018.8392660

[2] A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images [J].

Ameen, Salem ;

Vadera, Sunil .

EXPERT SYSTEMS, 2017, 34 (03)

[3] Web Based Recognition and Translation of American Sign Language with CNN and RNN [J].

Bendarkar, Dhanashree Shyam ;

Somase, Pratiksha Appasaheb ;

Rebari, Preety Kalyansingh ;

Paturkar, Renuka Ramkrishna ;

Khan, Arjumand Masood .

INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2021, 17 (01) :34-50

[4]

Bergstra J, 2011, P 24 INT C NEURAL IN, V24

[5] Understanding Robustness of Transformers for Image Classification [J].

Bhojanapalli, Srinadh ;

Chakrabarti, Ayan ;

Glasner, Daniel ;

Li, Daliang ;

Unterthiner, Thomas ;

Veit, Andreas .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :10211-10221

[6]

Bowles C., 2018, arXiv

[7] Non-Autoregressive Transformer for Speech Recognition [J].

Chen, Nanxin ;

Watanabe, Shinji ;

Villalba, Jesus ;

Zelasko, Piotr ;

Dehak, Najim .

IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :121-125

[8] A review of hand gesture and sign language recognition techniques [J].

Cheok, Ming Jin ;

Omar, Zaid ;

Jaward, Mohamed Hisham .

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (01) :131-153

[9] A convolutional neural network with feature fusion for real-time hand posture recognition [J].

Chevtchenko, Sergio F. ;

Vale, Rafaella F. ;

Macario, Valmir ;

Cordeiro, Filipe R. .

APPLIED SOFT COMPUTING, 2018, 73 :748-766

[10]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

← 1 2 3 4 5 6 →