Benefits of pre-trained mono- and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech

被引:0
作者
Pu Wang
Hugo Van hamme
机构
[1] KU Leuven,Department of Electrical Engineering
来源
EURASIP Journal on Audio, Speech, and Music Processing | / 2023卷
关键词
Spoken language understanding; Low resources dysarthric speech; Pre-training; Self-supervised learning; Transformers; Time-delay neural network; Whisper; XLSR-53; Wav2Vec2; Impairment intelligibility;
D O I
暂无
中图分类号
学科分类号
摘要
With the rise of deep learning, spoken language understanding (SLU) for command-and-control applications such as a voice-controlled virtual assistant can offer reliable hands-free operation to physically disabled individuals. However, due to data scarcity, it is still a challenge to process dysarthric speech. Pre-training (part of) the SLU model with supervised automatic speech recognition (ASR) targets or with self-supervised learning (SSL) may help to overcome a lack of data, but no research has shown which pre-training strategy performs better for SLU on dysarthric speech and to which extent the SLU task benefits from knowledge transfer from pre-training with dysarthric acoustic tasks. This work aims to compare different mono- or cross-lingual pre-training (supervised and unsupervised) methodologies and quantitatively investigates the benefits of pre-training for SLU tasks on Dutch dysarthric speech. The designed SLU systems consist of a pre-trained speech representations encoder and a SLU decoder to map encoded features to intents. Four types of pre-trained encoders, a mono-lingual time-delay neural network (TDNN) acoustic model, a mono-lingual transformer acoustic model, a cross-lingual transformer acoustic model (Whisper), and a cross-lingual SSL Wav2Vec2.0 model (XLSR-53), are evaluated complemented with three types of SLU decoders: non-negative matrix factorization (NMF), capsule networks, and long short-term memory (LSTM) networks. The acoustic analysis of the four pre-trained encoders are tested on Dutch dysarthric home-automation data with word error rate (WER) results to investigate the correlations of the dysarthric acoustic task (ASR) and the semantic task (SLU). By introducing the intelligibility score (IS) as a metric of the impairment severity, this paper further quantitatively analyzes dysarthria-severity-dependent models for SLU tasks.
引用
收藏
相关论文
共 19 条
[1]  
Bastianelli E(2017)Structured learning for spoken language understanding in human-robot interaction Int. J. Robot. Res. 36 660-683
[2]  
Castellucci G(2019)Design and investigation of capsule networks for sentence classification Appl. Sci. 9 2200-27839
[3]  
Croce D(2021)Unsupervised speech recognition Adv. Neural Inf. Process. Syst. 34 27826-334
[4]  
Basili R(2019)Articulatory and bottleneck features for speaker-independent asr of dysarthric speech Comput. Speech Lang. 58 319-541
[5]  
Nardi D(2012)The torgo database of acoustic and articulatory speech from speakers with dysarthria Lang. Resour. Eval. 46 523-836
[6]  
Fentaw HW(1979)Robust locally weighted regression and smoothing scatterplots Journal of the American statistical association 74 829-undefined
[7]  
Kim TH(undefined)undefined undefined undefined undefined-undefined
[8]  
Baevski A(undefined)undefined undefined undefined undefined-undefined
[9]  
Hsu WN(undefined)undefined undefined undefined undefined-undefined
[10]  
Conneau A(undefined)undefined undefined undefined undefined-undefined