Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients

被引:1
作者
Manju D. Pawar
Rajendra D. Kokate
机构
[1] Maharashtra Institute of Technology,
[2] Government College of Engineering,undefined
来源
Multimedia Tools and Applications | 2021年 / 80卷
关键词
Convolution neural network; Feature extraction; Speech emotion recognition; Energy; Pitch;
D O I
暂无
中图分类号
学科分类号
摘要
A significant role is played by Speech Emotion Recognition (SER) with different applications in affective computing and human-computer interface. In literature, the most adapted technique for recognition of emotion was based on simple feature extraction using a simple classifier. Most of the methods in the literature has limited efficiency for the recognition of emotion. Hence for solving these drawbacks, five various models based on Convolution Neural Network (CNN) was proposed in this paper for recognition of emotion through signals obtained on speech. In the methodology which was proposed, seven different emotions are recognised with the utilisation of CNN with feature extraction methods includes disgust, normal, fear Joy, Anger, Sadness and surprise. Initially, the speech emotion signals are collected from the database such as berlin database. After that, feature extraction is considered, and it is carried out by the Pitch and Energy, Mel-Frequency Cepstral Coefficients (MFCC) and Mel Energy Spectrum Dynamic Coefficients (MEDC). The mentioned feature extraction process is widely used for classifying the speech data and perform better in performance. Mel-cepstral coefficients utilise less time for shaping the spectral with adequate data and offers better voice quality. The extracted features are used for the recognition purpose by CNN network. In the proposed CNN network, either one or more pairs of convolutions, besides, max-pooling layers remain present. With the utilisation of the CNN network, the emotions are recognised through the input speech signal. The proposed method is implemented in MATLAB, and it will be contrasted with the existing method such as Linear Prediction Cepstral Coefficient (LPCC) with the K-Nearest Neighbour (KNN) classifier to test the samples for optimal performance evaluation. The Statistical measurements are utilised for analysing the performance such as accuracy, precision, specificity, recall, sensitivity, error rate, receiver operating characteristics (ROC) curve, an area under curve (AUC), and False Positive Rate (FPR).
引用
收藏
页码:15563 / 15587
页数:24
相关论文
共 91 条
[1]  
Ayadi E(2011)Survey on speech emotion recognition: features, classification schemes, and databases Pattern Recogn 44 572-587
[2]  
Moataz KMS(2020)Speech emotion recognition using unsupervised feature selection algorithms Radio Eng 29 353-826
[3]  
Karray F(2015)Comparative study of MFCC and LPC algorithms for Gujrati isolated word recognition Int J Innov Res Comput Commun Eng 3 822-37
[4]  
Bandela SR(2018)Automatic speech recognition errors detection and correction: a review Procedia Comput Sci 128 32-68
[5]  
Kumar TK(2017)Evaluating deep learning architectures for speech emotion recognition Neural Netw 92 60-237
[6]  
Chauhan HB(2016)A mel-frequency cepstral coefficient-based approach for surface roughness diagnosis in hard turning using acoustic signals and gaussian mixture models Appl Acoust 113 230-509
[7]  
Tanawala BA(2019)Emotion recognition from singing voices using contemporary commercial music and classical styles J Voice 33 501-366
[8]  
Errattahi R(2015)Speech emotion recognition with unsupervised feature learning Front Information Technol Electron Eng 16 358-63
[9]  
Hannani AE(2019)Speech emotion recognition with heterogeneous feature unification of deep neural network Sensors 19 2730-55
[10]  
Ouahmane H(2017)Epoch extraction from emotional speech using single frequency filtering approach Speech Comm 86 52-35