End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation

被引:0
作者
Hong-In Yun
Jeong-Sik Park
机构
[1] Hankuk University of Foreign Studies,Department of English Linguistics
[2] Hankuk University of Foreign Studies,Department of English Linguistics & Language Technology
来源
Multimedia Tools and Applications | 2023年 / 82卷
关键词
Emotional speech recognition; Deep neural network; Model adaptation; Model compression; Knowledge distillation;
D O I
暂无
中图分类号
学科分类号
摘要
The end-to-end approach provides better performance in speech recognition compared to the traditional hidden Markov model-deep neural network (HMM-DNN)-based approach, but still shows poor performance in abnormal speech, especially emotional speech. The optimal solution is to build an acoustic model suitable for emotional speech recognition using only emotional speech data for each emotion, but it is impossible because it is difficult to collect sufficient amount of emotional speech data for each emotion. In this study, we propose a method to improve the emotional speech recognition performance by using the knowledge distillation technique that was originally introduced to decrease computational intensity of deep learning-based approaches by reducing the number of model parameters. In addition to its use as model compression, we employ this technique for model adaptation to emotional speech. The proposed method builds a basic model (referred to as a teacher model) with a number of model parameters using an amount of normal speech data, and then constructs a target model (referred to as a student model) with fewer model parameters using a small amount of emotional speech data (i.e., adaptation data). Since the student model is built with emotional speech data, it is expected to reflect the emotional characteristics of each emotion well. In the emotional speech recognition experiment, the student model maintained recognition performance regardless of the number of model parameters, whereas the teacher model degraded performance significantly as the number of parameters decreased, showing performance degradation of about 10% in word error rate. This result demonstrates that the student model serves as an acoustic model suitable for emotional speech recognition even though it does not require much emotional speech data.
引用
收藏
页码:22759 / 22776
页数:17
相关论文
共 48 条
  • [1] Alkhulaifi A(2021)Knowledge distillation in deep learning and its applications PeerJ Comput Sci 7 e474-444
  • [2] Alsahli F(2005)ASR for emotional speech: clarifying the issues and enhancing performance Neural Netw 18 437-359
  • [3] Ahmad I(2008)IEMOCAP: interactive emotional dyadic motion capture database Lang Resour Eval 42 335-42
  • [4] Athanaselis T(2011)Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition IEEE Trans Audio Speech Lang Process 20 30-2030
  • [5] Bakamidis S(2016)Domain-adversarial training of neural networks J Mach Learn Res 17 2096-2680
  • [6] Dologlou I(2014)Generative adversarial nets Adv Neural Inf Process Syst 27 2672-134
  • [7] Cowie R(2016)Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition Eng Appl Artif Intell 52 126-13
  • [8] Douglas-Cowie E(2021)Accented speech recognition based on end-to-end domain adversarial training of neural networks Appl Sci 11 1-1920
  • [9] Cox C(2010)A novel approach to HMM-based speech recognition systems using particle swarm optimization Math Comput Model 52 1910-1773
  • [10] Busso C(2012)Using DTW neural-based MFCC warping to improve emotional speech recognition Neural Comput Appl 21 1765-14018