Effective MLP and CNN based ensemble learning for speech emotion recognition

被引:0
作者
Middya A.I. [1 ]
Nag B. [2 ]
Roy S. [1 ]
机构
[1] Department of Computer Science and Engineering, Jadavpur University, Kolkata
[2] Department of Mathematics, Jadavpur University, Kolkata
关键词
Classification; Convolutional neural network; Deep learning; Speech emotion recognition;
D O I
10.1007/s11042-024-19017-x
中图分类号
学科分类号
摘要
Speech emotion recognition (SER) is one of the most important and active areas of. research in speech processing. Numerous approaches have been proposed to address various limitations in this field, but the sheer diversity of speech emotions, as well as their complexity, continue to make SER a tough nut to crack. This paper attempts to conduct a thorough investigation into speech emotion recognition in order to determine the most appropriate feature set and model for SER. A multi-layer perceptron (MLP) and convolutional neural network (CNN) based ensemble model for SER is proposed, which is a simple yet very powerful model for SER that can greatly improve classification accuracy. The model’s performance is evaluated based on four benchmark datasets, namely RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song), EmoDB (Emotional Dat0abase), SAVEE (Surrey Audio-Visual Expressed Emotion), and TESS (Toronto Emotional Speech Set). The proposed model dominates over several baseline methods (decision tree (DT), random forest (RF), support vector machine (SVM), k-nearest neighbour (KNN), and the base learners, i.e., MLP and CNN) in terms of various performance metrics for all the datasets. Furthermore, the proposed model outperforms all previous works for RAVDESS (Acc=73.1%), SAVEE (Acc=83.8%), and TESS (Acc=99.9%) datasets. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:83963 / 83990
页数:27
相关论文
共 60 条
[1]  
Trigeorgis G., Ringeval F., Brueckner R., Marchi E., Nicolaou M.A., Schuller B., Zafeiriou S., Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200-5204, (2016)
[2]  
Chen C.H., Lu P.T., Chen O.T.C., Classification of four affective modes in online songs and speeches, The 19Th Annual Wireless and Optical Communications Conference (WOCC 2010), pp. 1-4, (2010)
[3]  
Busso C., Lee S., Narayanan S., Analysis of emotionally salient aspects of fundamental frequency for emotion detection, IEEE Transactions on Audio, Speech, lang process, 17, 4, pp. 582-596, (2009)
[4]  
Wu S., Falk T.H., Chan W.Y., Automatic speech emotion recognition using modulation spectral features, Speech Comm, 53, 5, pp. 768-785, (2011)
[5]  
Rieger S.A., Muraleedharan R., Ramachandran R.P., Speech based emotion recognition using spectral feature extraction and an ensemble of knn classifiers, In the 9Th International Symposium on Chinese Spoken Language Processing. IEEE, pp. 589-593, (2014)
[6]  
Mittal S., Agarwal S., Nigam M.J., Real time multiple face recognition: A deep learning approach, Proceedings of the 2018 International Conference on Digital Medicine and Image Processing, pp. 70-76, (2018)
[7]  
Huang K.Y., Wu C.-H., Hong Q.-B., Su M.-H., Chen Y.-H., Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5866-5870, (2019)
[8]  
He K., Zhang X., Ren S., Sun J., Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, (2016)
[9]  
Bae H.-S., Lee H.-J., Lee S.-G., Voice recognition based on adaptive mfcc and deep learning, 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), pp. 1542-1546, (2016)
[10]  
Lim W., Jang D., Lee T., Speech emotion recognition using convolutional and recurrent neural networks, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1-4, (2016)