Exploring discrete speech units for privacy-preserving and efficient speech recognition for school-aged and preschool children

被引：0

作者：

Dutta, Satwik ^{[1
]}

Irvin, Dwight ^{[2
]}

Hansen, John H. L. ^{[1
]}

机构：

[1] Univ Texas Dallas, Ctr Robust Speech Syst CRSS, Dallas, TX 75080 USA

[2] Univ Florida, Anita Zucker Ctr Excellence Early Childhood Studie, Gainesville, FL USA

来源：

INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES | 2025年 / 199卷

基金：

美国国家科学基金会;

关键词：

Automatic speech recognition; Discrete speech representation; Child speech processing; Speaker privacy; Early childhood; Educational technology; Preschool children; Developmental delay; SPEAKER;

D O I：

10.1016/j.ijhcs.2025.103460

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Organizations across the world, including NATO, OECD, the WHO, and the United Nations, as well as many governments, are now employing guidelines for safe, secure, and trustworthy Artificial Intelligence (AI). While technology policies are still being formulated, many AI applications catered toward children have already been developed or are still developing. While designing any child-centered AI, it is utmost importance to keep the children's privacy at the forefront. One modality for child-centered AI is speech/language communication, which has found applications in various educational technologies, tutoring services, as well as interactive learning and social robots. Although, short of a full de-identification of speech segments, longer duration sentences and audio content could reveal partial neutral identifying information (e.g., gender of a child, etc.), but if taken in longer duration context with sequenced longitudinal data (e.g., audio recordings over full days at home or in classrooms, and linked over time), privacy concerns will grow and be critical. Motivated by a privacy-preserving design, this study explores the use of discrete speech units as a form of anonymous encoding, to develop Automatic Speech Recognition (ASR) systems for children that better ensure privacy protection. The primary goal here is to ascertain that discrete speech units retain the key linguistic information for the ASR task of output text creation, but simultaneously lack identifying speaker-specific information, or the ability to potentially re-generate the original speech waveform given the available sequence of discrete speech units. Here, a Discrete ASR model trained on the My Science Tutor Children's Conversational Speech Corpus (MyST) archives an output word-error-rate (WER) of 15.7%. Our Discrete ASR model achieves similar performance in terms of WER when compared to state-of-the-art End-to-End (E2E) ASR models trained using features extracted from large-scale self-supervised pre-trained speech processing model (such as WavLM), although it is noted that E2E ASR models are almost 10 times larger in model checkpoint memory size and number of model parameters and takes 3x the amount of time to train. In addition, open-domain testing on other popular child speech corpora confirms that the proposed Discrete ASR models perform equal to E2E ASR models for corpora containing children speech in the same age range as MyST (e.g., CMU corpus) and slightly lower performance for a corpus containing a wider age range of children (e.g., OGI corpus). Finally, this study also shows that child ASR using the proposed discrete speech units achieves promising performance in recognizing WH-Words, Nouns, Verbs, and Pronouns in an early childhood case study of teacher-child interactions in a childcare facility, involving preschool children with and without speech/language delays which is an extremely vulnerable and challenging speech/language assessment population.

引用

页数：16

共 50 条

[1] Privacy-Preserving Speaker Verification and Speech Recognition
Abbasi, Wisam
EMERGING TECHNOLOGIES FOR AUTHORIZATION AND AUTHENTICATION, ETAA 2022, 2023, 13782 : 102 - 119
[2] Configurable Privacy-Preserving Automatic Speech Recognition
Aloufi, Ranya
Haddadi, Hamed
Boyle, David
INTERSPEECH 2021, 2021, : 861 - 865
[3] EXPLORING HASHING AND CRYPTONET BASED APPROACHES FOR PRIVACY-PRESERVING SPEECH EMOTION RECOGNITION
Dias, Miguel
Abad, Alberto
Trancoso, Isabel
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2057 - 2061
[4] Otitis Media and Speech-in-Noise Recognition in School-Aged Children
Zumach, A.
Gerrits, E.
Chenault, M. N.
Anteunis, L. J. C.
AUDIOLOGY AND NEURO-OTOLOGY, 2009, 14 (02) : 121 - 129
[5] Perception of Speech Sounds in School-Aged Children with Speech Sound Disorders
Preston, Jonathan L.
Irwin, Julia R.
Turcios, Jacqueline
SEMINARS IN SPEECH AND LANGUAGE, 2015, 36 (04) : 224 - 233
[6] Privacy-Preserving Outsourced Speech Recognition for Smart IoT Devices
Ma, Zhuo
Liu, Yang
Liu, Ximeng
Ma, Jianfeng
Li, Feifei
IEEE INTERNET OF THINGS JOURNAL, 2019, 6 (05): : 8406 - 8420
[7] LANGUAGE AND FLUENCY VARIABLES IN THE CONVERSATIONAL SPEECH OF LINGUISTICALLY ADVANCED PRESCHOOL AND SCHOOL-AGED CHILDREN
ENGER, NC
HOOD, SB
SHULMAN, BB
JOURNAL OF FLUENCY DISORDERS, 1988, 13 (03) : 173 - 198
[8] Spectral Integration and Bandwidth Effects on Speech Recognition in School-Aged Children and Adults
Mlot, Stefan
Buss, Emily
Hall, Joseph W., III
EAR AND HEARING, 2010, 31 (01): : 56 - 62
[9] Generating gender-ambiguous voices for privacy-preserving speech recognition
Stoidis, Dimitrios
Cavallaro, Andrea
INTERSPEECH 2022, 2022, : 4237 - 4241
[10] A novel privacy-preserving speech recognition framework using bidirectional LSTM
Qingren Wang
Chuankai Feng
Yan Xu
Hong Zhong
Victor S. Sheng
Journal of Cloud Computing, 9

← 1 2 3 4 5 →