Locality-Aware Transformer for Video-Based Sign Language Translation

被引:7
作者
Guo, Zihui [1 ]
Hou, Yonghong [1 ]
Hou, Chunping [1 ]
Yin, Wenjie [1 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
关键词
Videos; Assistive technologies; Gesture recognition; Transformers; Encoding; Visualization; Task analysis; Multi-stride position encoding; adaptive temporal interaction; gloss counting task; sign language translation; RECOGNITION;
D O I
10.1109/LSP.2023.3263808
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, the application of transformer makes significant progress in sign language translation. However, several characteristics of sign videos are neglected in existing transformer-based methods that hinder translation performance. Firstly, in sign videos, multiple consecutive frames represent a single sign gloss thus the local temporal relations are crucial. Secondly, the inconsistency between video and text demands the non-local and global context modeling ability of the model. To address these issues, a locality-aware transformer is proposed for sign language translation. Concretely, the multi-stride position encoding scheme assigns the same position index to adjacent frames with various strides to enhance the local dependency. Afterward, the adaptive temporal interaction module is utilized to capture non-local and flexible local frame correlation simultaneously. Moreover, a gloss counting task is designed to facilitate the holistic understanding of sign videos. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed framework.
引用
收藏
页码:364 / 368
页数:5
相关论文
共 31 条
[1]   Multi-channel Transformers for Multi-articulatory Sign Language Translation [J].
Camgoz, Necati Cihan ;
Koller, Oscar ;
Hadfield, Simon ;
Bowden, Richard .
COMPUTER VISION - ECCV 2020 WORKSHOPS, PT IV, 2020, 12538 :301-319
[2]   Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [J].
Camgoz, Necati Cihan ;
Koller, Oscar ;
Hadfield, Simon ;
Bowden, Richard .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10020-10030
[3]   Neural Sign Language Translation [J].
Camgoz, Necati Cihan ;
Hadfield, Simon ;
Koller, Oscar ;
Ney, Hermann ;
Bowden, Richard .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7784-7793
[4]  
Cao Y, 2022, FINDINGS ASS COMPUTA, P2679, DOI [DOI 10.18653/V1/2022.FINDINGS-NAACL.205.URL, 10.18653/v1/2022.findings-naacl.205.URL]
[5]  
Chen Y., 2022, CVPR, P5120
[6]  
Fu B, 2023, PROC INT C ACOUSTICS
[7]   Skeleton-Aware Neural Sign Language Translation [J].
Gan, Shiwei ;
Yin, Yafeng ;
Jiang, Zhiwei ;
Xie, Lei ;
Lu, Sanglu .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :4353-4361
[8]   Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation [J].
Guo, Dan ;
Zhou, Wengang ;
Li, Anyang ;
Li, Houqiang ;
Wang, Meng .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 (29) :1575-1590
[9]  
Guo D, 2018, AAAI CONF ARTIF INTE, P6845
[10]  
Guo MS, 2019, AAAI CONF ARTIF INTE, P6489