Spatial-temporal transformer for end-to-end sign language recognition

被引：10

作者：

Cui, Zhenchao ^{[1
,2
]}

Zhang, Wenbo ^{[1
,2
,3
]}

Li, Zhaoxin ^{[3
]}

Wang, Zhaoqi ^{[3
]}

机构：

[1] Hebei Univ, Sch Cyber Secur & Comp, Baoding 071002, Hebei, Peoples R China

[2] Hebei Univ, Hebei Machine Vis Engn Res Ctr, Baoding 071002, Hebei, Peoples R China

[3] Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China

来源：

COMPLEX & INTELLIGENT SYSTEMS | 2023年 / 9卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Spatial-temporal encoder; Continuous sign language recognition; Transformer; Patched image; ATTENTION;

D O I：

10.1007/s40747-023-00977-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Continuous sign language recognition (CSLR) is an essential task for communication between hearing-impaired and people without limitations, which aims at aligning low-density video sequences with high-density text sequences. The current methods for CSLR were mainly based on convolutional neural networks. However, these methods perform poorly in balancing spatial and temporal features during visual feature extraction, making them difficult to improve the accuracy of recognition. To address this issue, we designed an end-to-end CSLR network: Spatial-Temporal Transformer Network (STTN). The model encodes and decodes the sign language video as a predicted sequence that is aligned with a given text sequence. First, since the image sequences are too long for the model to handle directly, we chunk the sign language video frames, i.e., "image to patch", which reduces the computational complexity. Second, global features of the sign language video are modeled at the beginning of the model, and the spatial action features of the current video frame and the semantic features of consecutive frames in the temporal dimension are extracted separately, giving rise to fully extracting visual features. Finally, the model uses a simple cross-entropy loss to align video and text. We extensively evaluated the proposed network on two publicly available datasets, CSL and RWTH-PHOENIX-Weather multi-signer 2014 (PHOENIX-2014), which demonstrated the superior performance of our work in CSLR task compared to the state-of-the-art methods.

引用

页码：4645 / 4656

页数：12

共 50 条

[41] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
Gao, Qiang
Wu, Haiwei
Sun, Yanqing
Duan, Yitao
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
[42] Combining CNN and Transformer as Encoder to Improve End-to-End Handwritten Mathematical Expression Recognition Accuracy
Zhang, Zhang
Zhang, Yibo
FRONTIERS IN HANDWRITING RECOGNITION, ICFHR 2022, 2022, 13639 : 185 - 197
[43] End-to-end automated speech recognition using a character based small scale transformer architecture
Loubser, Alexander
De Villiers, Pieter
De Freitas, Allan
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 252
[44] End-to-end Image Compression with Swin-Transformer
Wang, Meng
Zhang, Kai
Zhang, Li
Li, Yue
Li, Junru
Wang, Yue
Wang, Shiqi
2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
[45] A Novel End-to-End Transformer for Scene Graph Generation
Ren, Chengkai
Liu, Xiuhua
Cao, Mengyuan
Zhang, Jian
Wang, Hongwei
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[46] End-to-end Neural Diarization: From Transformer to Conformer
Liu, Yi Chieh
Han, Eunjung
Lee, Chul
Stolcke, Andreas
INTERSPEECH 2021, 2021, : 3081 - 3085
[47] Identification of Geochemical Anomalies Using an End-to-End Transformer
Yu, Shuyan
Deng, Hao
Liu, Zhankun
Chen, Jin
Xiao, Keyan
Mao, Xiancheng
NATURAL RESOURCES RESEARCH, 2024, 33 (03) : 973 - 994
[48] Transformer Based End-to-End Mispronunciation Detection and Diagnosis
Wu, Minglin
Li, Kun
Leung, Wai-Kim
Meng, Helen
INTERSPEECH 2021, 2021, : 3954 - 3958
[49] SDformer: Efficient End-to-End Transformer for Depth Completion
Qian, Jian
Sun, Miao
Lee, Ashley
Li, Jie
Zhuo, Shenglong
Chiang, Patrick Yin
2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL AUTOMATION, ROBOTICS AND CONTROL ENGINEERING, IARCE, 2022, : 56 - 61
[50] End-to-End Transformer for Compressed Video Quality Enhancement
Yu, Li
Chang, Wenshuai
Wu, Shiyu
Gabbouj, Moncef
IEEE TRANSACTIONS ON BROADCASTING, 2024, 70 (01) : 197 - 207

← 1 2 3 4 5 →