EXPLORING ATTENTION MECHANISMS IN INTEGRATION OF MULTI-MODAL INFORMATION FOR SIGN LANGUAGE RECOGNITION AND TRANSLATION

被引：0

作者：

Hakim, Zaber Ibn Abdul ^{[1
]}

Swargo, Rasman Mubtasim ^{[1
]}

Adnan, Muhammad Abdullah ^{[1
]}

机构：

[1] Bangladesh Univ Engn & Technol BUET, Dhaka, Bangladesh

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2024年

关键词：

Sign Language; Multi-modal Learning; TRANSFORMERS;

D O I：

10.1109/ICIP51287.2024.10648021

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Understanding intricate and fast-paced movements of body parts is essential for the recognition and translation of sign language. The inclusion of additional information intended to identify and locate the moving body parts has been an interesting research topic recently. However, previous works on using multi-modal information raise concerns such as sub-optimal multi-modal feature merging method, or the model itself being too computationally heavy. In our work, we have addressed such issues and used a plugin module based on cross-attention to properly attend to each modality with another. Moreover, we utilized 2-stage training to remove the dependency of separate feature extractors for additional modalities in an end-to-end approach, which reduces the concern about computational complexity. Besides, our additional cross-attention plugin module is very lightweight which doesn't add significant computational overhead on top of the original baseline. We have evaluated the performance of our approaches on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for the sign language translation task. Our approach reduced the WER by 0.9 on the recognition task and increased the BLEU-4 scores by 0.8 on the translation task.

引用

页码：2529 / 2535

页数：7

共 28 条

[1]

[Anonymous], 2021, Sign language in schools?

[2] Joint Visual and Audio Learning for Video Highlight Detection [J].

Badamdorj, Taivanbat ;

Rochan, Mrigank ;

Wang, Yang ;

Cheng, Li .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :8107-8117

[3] Multi-channel Transformers for Multi-articulatory Sign Language Translation [J].

Camgoz, Necati Cihan ;

Koller, Oscar ;

Hadfield, Simon ;

Bowden, Richard .

COMPUTER VISION - ECCV 2020 WORKSHOPS, PT IV, 2020, 12538 :301-319

[4] Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [J].

Camgoz, Necati Cihan ;

Koller, Oscar ;

Hadfield, Simon ;

Bowden, Richard .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10020-10030

[5] Neural Sign Language Translation [J].

Camgoz, Necati Cihan ;

Hadfield, Simon ;

Koller, Oscar ;

Ney, Hermann ;

Bowden, Richard .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7784-7793

[6]

Chen YD, 2022, ADV NEUR IN

[7]

Coster MD, 2021, P88

[8] A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training [J].

Cui, Runpeng ;

Liu, Hu ;

Zhang, Changshui .

IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (07) :1880-1891

[9]

Forster Jens, 2012, Rwth- phoenix-weather: A large vocabulary sign language recogni- tion and translation corpus

[10]

Forster Jens, 2014, Extensions of the sign language recognition and translation corpus rwth-phoenix-weather, V1

← 1 2 3 →