Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention

被引:43
作者
De Coster, Mathieu [1 ]
Van Herreweghe, Mieke [2 ]
Dambre, Joni [1 ]
机构
[1] Univ Ghent, IMEC, IDLab AIRO, Technol Pk Zwijnaarde 126, Ghent, Belgium
[2] Univ Ghent, Blandijnberg 2, Ghent, Belgium
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021 | 2021年
基金
芬兰科学院; 欧盟地平线“2020”;
关键词
LANGUAGE;
D O I
10.1109/CVPRW53098.2021.00383
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic sign language recognition lies at the intersection of natural language processing (NLP) and computer vision. The highly successful transformer architectures, based on multi-head attention, originate from the field of NLP. The Video Transformer Network (VTN) is an adaptation of this concept for tasks that require video understanding, e.g., action recognition. However, due to the limited amount of labeled data that is commonly available for training automatic sign (language) recognition, the VTN cannot reach its full potential in this domain. In this work, we reduce the impact of this data limitation by automatically pre-extracting useful information from the sign language videos. In our approach, different types of information are offered to a VTN in a multi-modal setup. It includes per-frame human pose keypoints (extracted by OpenPose) to capture the body movement and hand crops to capture the (evolution of) hand shapes. We evaluate our method on the recently released AUTSL dataset for isolated sign recognition and obtain 92.92% accuracy on the test set using only RGB data. For comparison: the VTN architecture without hand crops and pose flow achieved 82% accuracy. A qualitative inspection of our model hints at further potential of multi-modal multi-head attention in a sign language recognition context.
引用
收藏
页码:3436 / 3445
页数:10
相关论文
共 35 条
[1]   BSL-1K: Scaling Up Co-articulated Sign Language Recognition Using Mouthing Cues [J].
Albanie, Samuel ;
Varol, Gul ;
Momeni, Liliane ;
Afouras, Triantafyllos ;
Chung, Joon Son ;
Fox, Neil ;
Zisserman, Andrew .
COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :35-53
[2]  
[Anonymous], 2015, ACS SYM SER
[3]  
Ba Jimmy Lei, 2016, arXiv, DOI DOI 10.48550/ARXIV.1607.06450
[4]   Variation in mouth actions with manual signs in Sign Language of the Netherlands (NGT) [J].
Bank, Richard ;
Crasborn, Onno ;
van Hout, Roeland .
SIGN LANGUAGE & LINGUISTICS, 2011, 14 (02) :248-270
[5]   Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective [J].
Bragg, Danielle ;
Koller, Oscar ;
Bellard, Mary ;
Berke, Larwan ;
Boudreault, Patrick ;
Braffort, Annelies ;
Caselli, Naomi ;
Huenerfauth, Matt ;
Kacorri, Hernisa ;
Verhoef, Tessa ;
Vogler, Christian ;
Morris, Meredith Ringel .
ASSETS'19: THE 21ST INTERNATIONAL ACM SIGACCESS CONFERENCE ON COMPUTERS AND ACCESSIBILITY, 2019, :16-31
[6]   SubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition [J].
Camgoz, Necati Cihan ;
Hadfield, Simon ;
Koller, Oscar ;
Bowden, Richard .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3075-3084
[7]   Multi-channel Transformers for Multi-articulatory Sign Language Translation [J].
Camgoz, Necati Cihan ;
Koller, Oscar ;
Hadfield, Simon ;
Bowden, Richard .
COMPUTER VISION - ECCV 2020 WORKSHOPS, PT IV, 2020, 12538 :301-319
[8]   Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [J].
Camgoz, Necati Cihan ;
Koller, Oscar ;
Hadfield, Simon ;
Bowden, Richard .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10020-10030
[9]   OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].
Cao, Zhe ;
Hidalgo, Gines ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186
[10]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733