Static hand gesture recognition method based on the Vision Transformer

被引：4

作者：

Zhang, Yu ^{[1
]}

Wang, Junlin ^{[1
]}

Wang, Xin ^{[1
]}

Jing, Haonan ^{[1
]}

Sun, Zhanshuo ^{[1
]}

Cai, Yu ^{[1
]}

机构：

[1] Inner Mongolia Univ, Coll Elect Informat Engn, Hohhot 010021, Peoples R China

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 82卷 / 20期

基金：

中国国家自然科学基金;

关键词：

Hand gesture recognition; Vision Transformer; Arm removal; Data augmentation; CONVOLUTIONAL NEURAL-NETWORKS; AMERICAN SIGN-LANGUAGE;

D O I：

10.1007/s11042-023-14732-3

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Hand gesture recognition (HGR) is the most important part of human-computer interaction (HCI). Static hand gesture recognition is equivalent to the classification of hand gesture images. At present, the classification of hand gesture images mainly uses the Convolutional Neural Network (CNN) method. The Vision Transformer architecture (ViT) proposes not to use the convolutional layers at all but to use the multi-head attention mechanism to learn global information. Therefore, this paper proposes a static hand gesture recognition method based on the Vision Transformer. This paper uses a self-made dataset and two publicly available American Sign Language (ASL) datasets to train and evaluate the ViT architecture. Using the depth information provided by the Microsoft Kinect camera to capture the hand gesture images and filter the background, then use the eight-connected discrimination algorithm and the distance transformation algorithm to remove the redundant arm information. The resulting images constitute a self-made dataset. At the same time, this paper studies the impact of several data augmentation strategies on recognition performance. This paper uses accuracy, F1 score, recall, and precision as evaluation metrics. Finally, the validation accuracy of the proposed model on the three datasets achieves 99.44%, 99.37%, and 96.53%, respectively, and the results obtained are better than those obtained by some CNN structures.

引用

页码：31309 / 31328

页数：20

共 53 条

[11]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[12]

DeVries Terrance, 2017, arXiv

[13]

Dosovitskiy A., 2021, arXiv

[14] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[15]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]

[16]

Islam MZ, 2019, 2019 JOINT 8TH INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV) AND 2019 3RD INTERNATIONAL CONFERENCE ON IMAGING, VISION & PATTERN RECOGNITION (ICIVPR) WITH INTERNATIONAL CONFERENCE ON ACTIVITY AND BEHAVIOR COMPUTING (ABC), P324, DOI [10.1109/ICIEV.2019.8858563, 10.1109/iciev.2019.8858563]

[17] Gesture Recognition of RGB and RGB-D Static Images Using Convolutional Neural Networks [J].

Khari, Manju ;

Garg, Aditya Kumar ;

Gonzalez Crespo, Ruben ;

Verdu, Elena .

INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2019, 5 (07) :22-27

[18] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[19] Hand gesture recognition based on convolution neural network [J].

Li, Gongfa ;

Tang, Heng ;

Sun, Ying ;

Kong, Jianyi ;

Jiang, Guozhang ;

Jiang, Du ;

Tao, Bo ;

Xu, Shuang ;

Liu, Honghai .

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (02) :S2719-S2729

[20] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].

Liu, Ze ;

Lin, Yutong ;

Cao, Yue ;

Hu, Han ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Guo, Baining .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002

← 1 2 3 4 5 6 →