Exploring discrete speech units for privacy-preserving and efficient speech recognition for school-aged and preschool children

被引:0
|
作者
Dutta, Satwik [1 ]
Irvin, Dwight [2 ]
Hansen, John H. L. [1 ]
机构
[1] Univ Texas Dallas, Ctr Robust Speech Syst CRSS, Dallas, TX 75080 USA
[2] Univ Florida, Anita Zucker Ctr Excellence Early Childhood Studie, Gainesville, FL USA
基金
美国国家科学基金会;
关键词
Automatic speech recognition; Discrete speech representation; Child speech processing; Speaker privacy; Early childhood; Educational technology; Preschool children; Developmental delay; SPEAKER;
D O I
10.1016/j.ijhcs.2025.103460
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Organizations across the world, including NATO, OECD, the WHO, and the United Nations, as well as many governments, are now employing guidelines for safe, secure, and trustworthy Artificial Intelligence (AI). While technology policies are still being formulated, many AI applications catered toward children have already been developed or are still developing. While designing any child-centered AI, it is utmost importance to keep the children's privacy at the forefront. One modality for child-centered AI is speech/language communication, which has found applications in various educational technologies, tutoring services, as well as interactive learning and social robots. Although, short of a full de-identification of speech segments, longer duration sentences and audio content could reveal partial neutral identifying information (e.g., gender of a child, etc.), but if taken in longer duration context with sequenced longitudinal data (e.g., audio recordings over full days at home or in classrooms, and linked over time), privacy concerns will grow and be critical. Motivated by a privacy-preserving design, this study explores the use of discrete speech units as a form of anonymous encoding, to develop Automatic Speech Recognition (ASR) systems for children that better ensure privacy protection. The primary goal here is to ascertain that discrete speech units retain the key linguistic information for the ASR task of output text creation, but simultaneously lack identifying speaker-specific information, or the ability to potentially re-generate the original speech waveform given the available sequence of discrete speech units. Here, a Discrete ASR model trained on the My Science Tutor Children's Conversational Speech Corpus (MyST) archives an output word-error-rate (WER) of 15.7%. Our Discrete ASR model achieves similar performance in terms of WER when compared to state-of-the-art End-to-End (E2E) ASR models trained using features extracted from large-scale self-supervised pre-trained speech processing model (such as WavLM), although it is noted that E2E ASR models are almost 10 times larger in model checkpoint memory size and number of model parameters and takes 3x the amount of time to train. In addition, open-domain testing on other popular child speech corpora confirms that the proposed Discrete ASR models perform equal to E2E ASR models for corpora containing children speech in the same age range as MyST (e.g., CMU corpus) and slightly lower performance for a corpus containing a wider age range of children (e.g., OGI corpus). Finally, this study also shows that child ASR using the proposed discrete speech units achieves promising performance in recognizing WH-Words, Nouns, Verbs, and Pronouns in an early childhood case study of teacher-child interactions in a childcare facility, involving preschool children with and without speech/language delays which is an extremely vulnerable and challenging speech/language assessment population.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Privacy-Preserving Speaker Verification and Speech Recognition
    Abbasi, Wisam
    EMERGING TECHNOLOGIES FOR AUTHORIZATION AND AUTHENTICATION, ETAA 2022, 2023, 13782 : 102 - 119
  • [2] Configurable Privacy-Preserving Automatic Speech Recognition
    Aloufi, Ranya
    Haddadi, Hamed
    Boyle, David
    INTERSPEECH 2021, 2021, : 861 - 865
  • [3] EXPLORING HASHING AND CRYPTONET BASED APPROACHES FOR PRIVACY-PRESERVING SPEECH EMOTION RECOGNITION
    Dias, Miguel
    Abad, Alberto
    Trancoso, Isabel
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2057 - 2061
  • [4] Otitis Media and Speech-in-Noise Recognition in School-Aged Children
    Zumach, A.
    Gerrits, E.
    Chenault, M. N.
    Anteunis, L. J. C.
    AUDIOLOGY AND NEURO-OTOLOGY, 2009, 14 (02) : 121 - 129
  • [5] Perception of Speech Sounds in School-Aged Children with Speech Sound Disorders
    Preston, Jonathan L.
    Irwin, Julia R.
    Turcios, Jacqueline
    SEMINARS IN SPEECH AND LANGUAGE, 2015, 36 (04) : 224 - 233
  • [6] Privacy-Preserving Outsourced Speech Recognition for Smart IoT Devices
    Ma, Zhuo
    Liu, Yang
    Liu, Ximeng
    Ma, Jianfeng
    Li, Feifei
    IEEE INTERNET OF THINGS JOURNAL, 2019, 6 (05): : 8406 - 8420
  • [7] LANGUAGE AND FLUENCY VARIABLES IN THE CONVERSATIONAL SPEECH OF LINGUISTICALLY ADVANCED PRESCHOOL AND SCHOOL-AGED CHILDREN
    ENGER, NC
    HOOD, SB
    SHULMAN, BB
    JOURNAL OF FLUENCY DISORDERS, 1988, 13 (03) : 173 - 198
  • [8] Spectral Integration and Bandwidth Effects on Speech Recognition in School-Aged Children and Adults
    Mlot, Stefan
    Buss, Emily
    Hall, Joseph W., III
    EAR AND HEARING, 2010, 31 (01): : 56 - 62
  • [9] Generating gender-ambiguous voices for privacy-preserving speech recognition
    Stoidis, Dimitrios
    Cavallaro, Andrea
    INTERSPEECH 2022, 2022, : 4237 - 4241
  • [10] A novel privacy-preserving speech recognition framework using bidirectional LSTM
    Qingren Wang
    Chuankai Feng
    Yan Xu
    Hong Zhong
    Victor S. Sheng
    Journal of Cloud Computing, 9