Spatial-temporal transformer for end-to-end sign language recognition

被引：10

作者：

Cui, Zhenchao ^{[1
,2
]}

Zhang, Wenbo ^{[1
,2
,3
]}

Li, Zhaoxin ^{[3
]}

Wang, Zhaoqi ^{[3
]}

机构：

[1] Hebei Univ, Sch Cyber Secur & Comp, Baoding 071002, Hebei, Peoples R China

[2] Hebei Univ, Hebei Machine Vis Engn Res Ctr, Baoding 071002, Hebei, Peoples R China

[3] Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China

来源：

COMPLEX & INTELLIGENT SYSTEMS | 2023年 / 9卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Spatial-temporal encoder; Continuous sign language recognition; Transformer; Patched image; ATTENTION;

D O I：

10.1007/s40747-023-00977-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Continuous sign language recognition (CSLR) is an essential task for communication between hearing-impaired and people without limitations, which aims at aligning low-density video sequences with high-density text sequences. The current methods for CSLR were mainly based on convolutional neural networks. However, these methods perform poorly in balancing spatial and temporal features during visual feature extraction, making them difficult to improve the accuracy of recognition. To address this issue, we designed an end-to-end CSLR network: Spatial-Temporal Transformer Network (STTN). The model encodes and decodes the sign language video as a predicted sequence that is aligned with a given text sequence. First, since the image sequences are too long for the model to handle directly, we chunk the sign language video frames, i.e., "image to patch", which reduces the computational complexity. Second, global features of the sign language video are modeled at the beginning of the model, and the spatial action features of the current video frame and the semantic features of consecutive frames in the temporal dimension are extracted separately, giving rise to fully extracting visual features. Finally, the model uses a simple cross-entropy loss to align video and text. We extensively evaluated the proposed network on two publicly available datasets, CSL and RWTH-PHOENIX-Weather multi-signer 2014 (PHOENIX-2014), which demonstrated the superior performance of our work in CSLR task compared to the state-of-the-art methods.

引用

页码：4645 / 4656

页数：12

共 50 条

[21] An Investigation of Positional Encoding in Transformer-based End-to-end Speech Recognition
Yue, Fengpeng
Ko, Tom
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[22] End-to-end point cloud registration with transformer
Wang, Yong
Zhou, Pengbo
Geng, Guohua
An, Li
Zhang, Qi
ARTIFICIAL INTELLIGENCE REVIEW, 2024, 58 (01)
[23] End-to-end lane detection with convolution and transformer
Ge, Zekun
Ma, Chao
Fu, Zhumu
Song, Shuzhong
Si, Pengju
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 29607 - 29627
[24] An End-to-End Transformer Model for Crowd Localization
Liang, Dingkang
Xu, Wei
Bai, Xiang
COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 38 - 54
[25] Sequential Transformer for End-to-End Person Search
Chen, Long
Xu, Jinhua
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 226 - 238
[26] End-to-End Video Text Spotting with Transformer
Wu, Weijia
Cai, Yuanqiang
Shen, Chunhua
Zhang, Debing
Fu, Ying
Zhou, Hong
Luo, Ping
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 4019 - 4035
[27] Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Zhang, Shuai
Wen, Zhengqi
INTERSPEECH 2020, 2020, : 5026 - 5030
[28] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
Miao, Haoran
Cheng, Gaofeng
Gao, Changfeng
Zhang, Pengyuan
Yan, Yonghong
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088
[29] AN EFFICIENT END-TO-END IMAGE COMPRESSION TRANSFORMER
Jeny, Afsana Ahsan
Junayed, Masum Shah
Islam, Md Baharul
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1786 - 1790
[30] Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription
Rios-Vila, Antonio
Calvo-Zaragoza, Jorge
Paquet, Thierry
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 20 - 37

← 1 2 3 4 5 →