StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition

被引:7
作者
Shen, Xiaolong [1 ]
Zheng, Zhedong [2 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, Hangzhou 310013, Zhejiang, Peoples R China
[2] Univ Macau, Taipa 999078, Macao, Peoples R China
基金
中国国家自然科学基金;
关键词
Sign language recognition; video analysis; MODEL;
D O I
10.1145/3656046
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based, and RGB-based methods, but both lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called the Spatial-temporal Part-aware network (StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long short-term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.
引用
收藏
页数:19
相关论文
共 83 条
  • [1] Albanie S, 2021, Arxiv, DOI arXiv:2111.03635
  • [2] Albanie Samuel, 2020, ECCV, DOI DOI 10.1007/978-3-030-58621-8_3
  • [3] Sign Pose-based Transformer for Word-level Sign Language Recognition
    Bohacek, Matyas
    Hruz, Marek
    [J]. 2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 182 - 191
  • [4] Buehler P, 2009, PROC CVPR IEEE, P2953, DOI 10.1109/CVPRW.2009.5206523
  • [5] Camgöz NC, 2020, PROC CVPR IEEE, P10020, DOI 10.1109/CVPR42600.2020.01004
  • [6] Skeleton-Based Action Recognition With Gated Convolutional Neural Networks
    Cao, Congqi
    Lan, Cuiling
    Zhang, Yifan
    Zeng, Wenjun
    Lu, Hanqing
    Zhang, Yanning
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (11) : 3247 - 3257
  • [7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [8] Cooper H, 2012, J MACH LEARN RES, V13, P2205
  • [9] A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training
    Cui, Runpeng
    Liu, Hu
    Zhang, Changshui
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (07) : 1880 - 1891
  • [10] RFNet: Region-aware Fusion Network for Incomplete Multi-modal Brain Tumor Segmentation
    Ding, Yuhang
    Yu, Xin
    Yang, Yi
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 3955 - 3964