American Sign language fingerspelling recognition in the wild with spatio temporal feature extraction and multi-task learning

被引:2
|
作者
Pannattee, Peerawat [1 ]
Kumwilaisak, Wuttipong [1 ]
Hansakunbuntheung, Chatchawarn [2 ]
Thatphithakkul, Nattanun [2 ]
Kuo, C. -C. Jay [3 ]
机构
[1] King Mongkuts Univ Technol Thonburi, Dept Elect & Telecommun Engn, Bangkok 10140, Thailand
[2] Natl Sci & Technol Dev Agcy, Assist Technol & Med Devices Res Ctr, Pathum Thani 12120, Thailand
[3] Univ Southern Calif, Ming Hsieh Dept Elect & Comp Engn, Los Angeles, CA 90007 USA
关键词
Fingerspelling recognition; Variable-filter-length temporal-learning; convolutional neural network; Multi-task learning; Supervised contrastive learning; Joint CTC/attention-based decoding;
D O I
10.1016/j.eswa.2023.122901
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study introduces a comprehensive approach to enhance the performance of fingerspelling recognition systems in dynamic environments. The methodology begins with spatial feature extraction using MobileNetV3Small, followed by transformation through a projection layer into a latent space. The Variable-Filter-Length Temporal-Learning Convolutional Neural Network (VTCNN) is then applied to extract both short-range and long-range temporal features, providing a robust representation of dynamic gestures. The recognition system incorporates a shared encoder for both the Connectionist Temporal Classification (CTC) decoder and the attention-based decoder, capitalizing on the unique strengths of each decoder. To address weak supervision challenges, a novel strategy involving supervised contrastive learning (SupCon) during retraining is proposed. Leveraging decoding results from the CTC decoder, an image set with frame labels is constructed, contributing to more efficient differentiation between fingerspelling gestures and improving overall accuracy. The final step involves a joint CTC/attention-based decoding strategy using the beam search algorithm. This approach effectively combines decoder outputs, resulting in superior recognition performance. The synergistic interplay of proposed methods-VTCNN for temporal feature extraction, multi-task learning for leveraging decoder strengths, SupCon for feature clustering refinement, and joint decoding-culminates in a holistic and stateof-the-art fingerspelling recognition system, validated through benchmarking on the ChicagoFSWild and ChicagoFSWild+ datasets.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] AMERICAN SIGN LANGUAGE FINGERSPELLING RECOGNITION IN THE WILD
    Shi, Bowen
    Del Rio, Aurora Martinez
    Keane, Jonathan
    Michaux, Jonathan
    Brentari, Diane
    Shakhnarovich, Greg
    Livescu, Karen
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 145 - 152
  • [2] American Sign Language Fingerspelling Recognition in the Wild with Iterative Language Model Construction
    Kumwilaisak, Wuttipong
    Pannattee, Peerawat
    Hansakunbuntheung, Chatchawarn
    Thatphithakkul, Nattanun
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2022, 11 (01)
  • [3] Deep Learning for American Sign Language Fingerspelling Recognition System
    Nguyen, Huy B. D.
    Hung Ngoc Do
    2019 26TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS (ICT), 2019, : 314 - 318
  • [4] Prototype Feature Extraction for Multi-task Learning
    Xin, Shen
    Jiao, Yuhang
    Long, Cheng
    Wang, Yuguang
    Wang, Xiaowei
    Yang, Sen
    Liu, Ji
    Zhang, Jie
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 2472 - 2481
  • [5] AMERICAN SIGN LANGUAGE FINGERSPELLING RECOGNITION WITH PHONOLOGICAL FEATURE-BASED TANDEM MODELS
    Kim, Taehwan
    Livescu, Karen
    Shakhnarovich, Gregory
    2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 119 - 124
  • [6] Spatio-Temporal Feature Extraction-Based Hand Gesture Recognition for Isolated American Sign Language and Arabic Numbers
    Elmezain, Mahmoud
    Al-Hamadi, Ayoub
    Pathan, Saira Saleem
    Michaelis, Bernd
    2009 PROCEEDINGS OF 6TH INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS (ISPA 2009), 2009, : 264 - 269
  • [7] AutoSTL: Automated Spatio-Temporal Multi-Task Learning
    Zhang, Zijian
    Zhao, Xiangyu
    Miao, Hao
    Zhang, Chunxu
    Zhao, Hongwei
    Zhang, Junbo
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 4, 2023, : 4902 - 4910
  • [8] Multi-Task Learning for Spatio-Temporal Event Forecasting
    Zhao, Liang
    Sun, Qian
    Ye, Jieping
    Chen, Feng
    Lu, Chang-Tien
    Ramakrishnan, Naren
    KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 1503 - 1512
  • [9] Spatio-temporal feature-extraction techniques for isolated gesture recognition in Arabic Sign Language
    Shanableh, Tamer
    Assaleh, Khaled
    Al-Rousan, M.
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2007, 37 (03): : 641 - 650
  • [10] Deep Multi View Spatio Temporal Spectral Feature Embedding on Skeletal Sign Language Videos for Recognition
    Ali, Sk Ashraf
    Prasad, M. V. D.
    Kumar, P. Praveen
    Kishore, P. V. V.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (04) : 810 - 819