StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition

被引：7

作者：

Shen, Xiaolong ^{[1
]}

Zheng, Zhedong ^{[2
]}

Yang, Yi ^{[1
]}

机构：

[1] Zhejiang Univ, Hangzhou 310013, Zhejiang, Peoples R China

[2] Univ Macau, Taipa 999078, Macao, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Sign language recognition; video analysis; MODEL;

D O I：

10.1145/3656046

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based, and RGB-based methods, but both lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called the Spatial-temporal Part-aware network (StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long short-term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.

引用

页数：19

共 83 条

[1] Albanie S, 2021, Arxiv, DOI arXiv:2111.03635
[2] Albanie Samuel, 2020, ECCV, DOI DOI 10.1007/978-3-030-58621-8_3
[3] Sign Pose-based Transformer for Word-level Sign Language Recognition
Bohacek, Matyas
Hruz, Marek
[J]. 2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 182 - 191
[4] Buehler P, 2009, PROC CVPR IEEE, P2953, DOI 10.1109/CVPRW.2009.5206523
[5] Camgöz NC, 2020, PROC CVPR IEEE, P10020, DOI 10.1109/CVPR42600.2020.01004
[6] Skeleton-Based Action Recognition With Gated Convolutional Neural Networks
Cao, Congqi
Lan, Cuiling
Zhang, Yifan
Zeng, Wenjun
Lu, Hanqing
Zhang, Yanning
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (11) : 3247 - 3257
[7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[8] Cooper H, 2012, J MACH LEARN RES, V13, P2205
[9] A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training
Cui, Runpeng
Liu, Hu
Zhang, Changshui
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (07) : 1880 - 1891
[10] RFNet: Region-aware Fusion Network for Incomplete Multi-modal Brain Tumor Segmentation
Ding, Yuhang
Yu, Xin
Yang, Yi
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3955 - 3964

← 1 2 3 4 5 6 7 8 9 →