Cross Attentive Multi-Cue Fusion for Skeleton-Based Sign Language Recognition

被引:0
作者
Ozdemir, Ogulcan [1 ]
Baytas, Inci M. [1 ]
Akarun, Lale [1 ]
机构
[1] Bogazici Univ, Comp Engn Dept, TR-34342 Istanbul, Turkiye
关键词
Hands; Topology; Sign language; Visualization; Systematic literature review; Joints; Videos; Training; Representation learning; Manuals; Sign language recognition; spatio-temporal representation learning; graph neural networks; multi-channel sequence modeling; cross attention; multi-cue fusion;
D O I
10.1109/ACCESS.2025.3579092
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Sign language, the primary communication medium of the Deaf, uses visual cues from upper body, hands, and face. Sign Language Recognition (SLR) aims to learn salient representations from these cues to bridge the communication gap between Deaf and hearing communities. Existing Graph Neural Network-based SLR frameworks often represent sign videos as sequences of graphs formed by hand and body joints. However, relying solely on upper body topology often results in suboptimal solutions. This work shows that incorporating domain-specific hand topologies can single-handedly help reach state-of-the-art SLR performance. This motivates the need for fusing multiple visual cues to build robust and generalizable SLR frameworks. Yet, the fusion process is challenged by changing spatial and temporal dynamics across articulators. To address this, we propose a multi-cue cross-attention framework that enables interactions between hand and upper body cues during fusion. We demonstrate how the proposed attention-based framework exposes distinct temporal patterns of visual cue representations extracted via Spatio-Temporal Graph Convolutional Network (ST-GCN) and exploits them for learning SL representations more effectively. Our experiments on two benchmark isolated sign language datasets, BosphorusSign22k and AUTSL, show that our proposed framework is on par with state-of-the-art performance on isolated SLR while highlighting the benefit of choosing domain-specific hand graph topologies and fusing multiple cues for SLR. Furthermore, our cross-attentive approach for fusion of upper-body and hand cues improves recognition accuracy by around 1% and 3% on the respective datasets over hand-only models while making the recognition interpretable and demonstrating the complementary interactions between visual cues.
引用
收藏
页码:106201 / 106217
页数:17
相关论文
共 78 条
[1]   Multi-Stream Isolated Sign Language Recognition Based on Finger Features Derived from Pose Data [J].
Akdag, Ali ;
Baykan, Omer Kaan .
ELECTRONICS, 2024, 13 (08)
[2]   Sign Pose-based Transformer for Word-level Sign Language Recognition [J].
Bohacek, Matyas ;
Hruz, Marek .
2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, :182-191
[3]  
Brentari D., 2018, Sign Language Phonology
[4]   SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition [J].
Camgoz, Necati Cihan ;
Hadfield, Simon ;
Koller, Oscar ;
Bowden, Richard .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3075-3084
[5]   Multi-channel Transformers for Multi-articulatory Sign Language Translation [J].
Camgoz, Necati Cihan ;
Koller, Oscar ;
Hadfield, Simon ;
Bowden, Richard .
COMPUTER VISION - ECCV 2020 WORKSHOPS, PT IV, 2020, 12538 :301-319
[6]   Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [J].
Camgoz, Necati Cihan ;
Koller, Oscar ;
Hadfield, Simon ;
Bowden, Richard .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10020-10030
[7]   Sign Language Recognition for Assisting the Deaf in Hospitals [J].
Camgoz, Necati Cihan ;
Kindiroglu, Ahmet Alp ;
Akarun, Lale .
HUMAN BEHAVIOR UNDERSTANDING, 2016, 9997 :89-101
[8]  
Cao Y., 2022, P FIND ASS COMP LING, P2679
[9]   OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].
Cao, Zhe ;
Hidalgo, Gines ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186
[10]   A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [J].
Chen, Yutong ;
Wei, Fangyun ;
Sun, Xiao ;
Wu, Zhirong ;
Lin, Stephen .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5110-5120