Spatial-temporal transformer for end-to-end sign language recognition

被引:10
|
作者
Cui, Zhenchao [1 ,2 ]
Zhang, Wenbo [1 ,2 ,3 ]
Li, Zhaoxin [3 ]
Wang, Zhaoqi [3 ]
机构
[1] Hebei Univ, Sch Cyber Secur & Comp, Baoding 071002, Hebei, Peoples R China
[2] Hebei Univ, Hebei Machine Vis Engn Res Ctr, Baoding 071002, Hebei, Peoples R China
[3] Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Spatial-temporal encoder; Continuous sign language recognition; Transformer; Patched image; ATTENTION;
D O I
10.1007/s40747-023-00977-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Continuous sign language recognition (CSLR) is an essential task for communication between hearing-impaired and people without limitations, which aims at aligning low-density video sequences with high-density text sequences. The current methods for CSLR were mainly based on convolutional neural networks. However, these methods perform poorly in balancing spatial and temporal features during visual feature extraction, making them difficult to improve the accuracy of recognition. To address this issue, we designed an end-to-end CSLR network: Spatial-Temporal Transformer Network (STTN). The model encodes and decodes the sign language video as a predicted sequence that is aligned with a given text sequence. First, since the image sequences are too long for the model to handle directly, we chunk the sign language video frames, i.e., "image to patch", which reduces the computational complexity. Second, global features of the sign language video are modeled at the beginning of the model, and the spatial action features of the current video frame and the semantic features of consecutive frames in the temporal dimension are extracted separately, giving rise to fully extracting visual features. Finally, the model uses a simple cross-entropy loss to align video and text. We extensively evaluated the proposed network on two publicly available datasets, CSL and RWTH-PHOENIX-Weather multi-signer 2014 (PHOENIX-2014), which demonstrated the superior performance of our work in CSLR task compared to the state-of-the-art methods.
引用
收藏
页码:4645 / 4656
页数:12
相关论文
共 50 条
  • [21] An Investigation of Positional Encoding in Transformer-based End-to-end Speech Recognition
    Yue, Fengpeng
    Ko, Tom
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [22] End-to-end point cloud registration with transformer
    Wang, Yong
    Zhou, Pengbo
    Geng, Guohua
    An, Li
    Zhang, Qi
    ARTIFICIAL INTELLIGENCE REVIEW, 2024, 58 (01)
  • [23] End-to-end lane detection with convolution and transformer
    Ge, Zekun
    Ma, Chao
    Fu, Zhumu
    Song, Shuzhong
    Si, Pengju
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 29607 - 29627
  • [24] An End-to-End Transformer Model for Crowd Localization
    Liang, Dingkang
    Xu, Wei
    Bai, Xiang
    COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 38 - 54
  • [25] Sequential Transformer for End-to-End Person Search
    Chen, Long
    Xu, Jinhua
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 226 - 238
  • [26] End-to-End Video Text Spotting with Transformer
    Wu, Weijia
    Cai, Yuanqiang
    Shen, Chunhua
    Zhang, Debing
    Fu, Ying
    Zhou, Hong
    Luo, Ping
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 4019 - 4035
  • [27] Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Zhang, Shuai
    Wen, Zhengqi
    INTERSPEECH 2020, 2020, : 5026 - 5030
  • [28] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
    Miao, Haoran
    Cheng, Gaofeng
    Gao, Changfeng
    Zhang, Pengyuan
    Yan, Yonghong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088
  • [29] AN EFFICIENT END-TO-END IMAGE COMPRESSION TRANSFORMER
    Jeny, Afsana Ahsan
    Junayed, Masum Shah
    Islam, Md Baharul
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1786 - 1790
  • [30] Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription
    Rios-Vila, Antonio
    Calvo-Zaragoza, Jorge
    Paquet, Thierry
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 20 - 37