Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems

被引:0
作者
Pavan Raju Kammili
B. H. V. S. Ramakrishnam Raju
A. Sri Krishna
机构
[1] Centurion University of Technology and Management,Department of Computer Science and Engineering
[2] SRKR Engineering College,Department of Information Technology
[3] Shri Vishnu Engineering College for Women,Department of Information Technology
来源
International Journal of Speech Technology | 2022年 / 25卷
关键词
ASR; Prosody; Data augmentation;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, the effect of emotional speech on the performance of neutral speech trained ASR systems is studied. Prosody-modification based data augmentation is explored to compensate the affected ASR performance due to emotional speech. The primary motive is to develop an Telugu ASR system that is least affected by these emotion based intrinsic speaker related acoustic variations. Two factors contributing towards the intrinsic speaker related variability that are focused in this research are the fundamental frequency [(F0)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(F_0)$$\end{document} or pitch] and the speaking rate variations. To simulate ASR task, we performed the training of our ASR system on neutral speech and tested it for data from emotional as well as neutral speech. Compared to the performance metrics of neutral speech at testing stage, emotional speech performance metrics are extremely degraded. This performance degradation is observed due to the difference in the prosody and speaking rate parameters of neutral and emotional speech. To overcome this performance degradation problem, prosody and speaking rate parameters are varied and modified to create the newer augmented versions of the training data. The original and augmented versions of the training data are pooled together and re-trained in order to capture greater emotion-specific variations. For the Telugu ASR experiments, we used Microsoft speech corpus for Indian languages(MSC-IL) for training neutral speech and Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) for evaluating emotional speech. The basic emotions of anger, happiness and sad are considered for evaluation along with neutral speech.
引用
收藏
页码:197 / 204
页数:7
相关论文
共 21 条
[1]  
Dhananjaya N(2010)Voiced/nonvoiced detection based on robustness of voiced epochs IEEE Signal Processing Letters 17 273-276
[2]  
Yegnanarayana B(2012)Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups IEEE Signal Processing Magazine 29 82-97
[3]  
Geoffrey H(2008)Epoch extraction from speech signals IEEE Transactions on Audio, Speech, and Language Processing 16 1602-1613
[4]  
Li D(2007)Determination of instants of significant excitation in speech using hilbert envelope and group delay function IEEE Signal Processing Letters 14 762-765
[5]  
Dong Y(2020)Creating speaker independent asr system through prosody modification based data augmentation Pattern Recognition Letters 131 213-218
[6]  
George ED(1995)Determination of instants of significant excitation in speech using group delay function IEEE Transactions on Speech and Audio Processing 3 325-333
[7]  
Mohamed A-R(2019)Application of emotion recognition and modification for emotional telugu speech recognition Mobile Networks and Applications 24 193-201
[8]  
Murty KSR(undefined)undefined undefined undefined undefined-undefined
[9]  
Yegnanarayana B(undefined)undefined undefined undefined undefined-undefined
[10]  
Rao KS(undefined)undefined undefined undefined undefined-undefined