Lightweight Deep Learning Framework for Speech Emotion Recognition

被引:8
作者
Akinpelu, Samson [1 ]
Viriri, Serestina [1 ]
Adegun, Adekanmi [1 ]
机构
[1] Univ KwaZulu Natal, Sch Math Stat & Comp Sci, ZA-4041 Durban, South Africa
关键词
~Deep learning; convolutional neural network; speech emotion; lightweight; humancomputer interaction; NEURAL-NETWORKS; RECURRENT; FEATURES;
D O I
10.1109/ACCESS.2023.3297269
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech Emotion Recognition (SER) system, which analyzes human utterances to determine a speaker's emotion, has a growing impact on how people and machines interact. Recent growth in human-computer interaction and computational intelligence has drawn the attention of many researchers in Artificial Intelligence (AI) to deep learning because of its wider applicability to several fields, including computer vision, natural language processing, and affective computing, among others. Deep learning models do not need any form of manually created features because they can automatically extract the prospective features from the input data. Deep learning models, however, call for a lot of resources, high processing power, and hyper-parameter tuning, making them unsuitable for lightweight devices. In this study, we focused on developing an efficient lightweight model for speech emotion recognition with optimized parameters without compromising performance. Our proposed model integrates Random Forest and Multilayer Perceptron(MLP) classifiers into the VGGNet framework for efficient speech emotion recognition. The proposed model was evaluated against other deep learning based methods (InceptionV3, ResNet, MobileNetV2, DenseNet) and it yielded low computational complexity with optimum performance. The experiment was carried out on three datasets of TESS, EMODB, and RAVDESS, and Mel Frequency Cepstral Coefficient(MFCC) features were extracted with 6-8 variants of emotions namely, Sad, Angry, Happy, Surprise, Neutral, Disgust, Fear, and Calm. Our model demonstrated high performance of 100%, 96%, and 86.25% accuracy on TESS, EMODB, and RAVDESS datasets respectively. This revealed that the proposed lightweight model achieved higher accuracy of recognition compared to the recent state-of-the-art model found in the literature.
引用
收藏
页码:77086 / 77098
页数:13
相关论文
共 70 条
[1]   Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models [J].
Abbaschian, Babak Joze ;
Sierra-Sosa, Daniel ;
Elmaghraby, Adel .
SENSORS, 2021, 21 (04) :1-27
[2]   Two-Way Feature Extraction for Speech Emotion Recognition Using Deep Learning [J].
Aggarwal, Apeksha ;
Srivastava, Akshat ;
Agarwal, Ajay ;
Chahal, Nidhi ;
Singh, Dilbag ;
Alnuaim, Abeer Ali ;
Alhadlaq, Aseel ;
Lee, Heung-No .
SENSORS, 2022, 22 (06)
[3]  
Ajagbe SA., 2021, INT J ADV COMPUT RES, V11, P51, DOI [DOI 10.19101/IJACR.2021.1152001, 10.19101/ijacr.2021.1152001]
[4]   Robust Feature Selection-Based Speech Emotion Classification Using Deep Transfer Learning [J].
Akinpelu, Samson ;
Viriri, Serestina .
APPLIED SCIENCES-BASEL, 2022, 12 (16)
[5]  
Amherst J., 2020, IEEE ACCESS, V8, P12452
[6]  
Anil Kumar C., 2021, Advances in Communications, Signal Processing, and VLSI. Lecture Notes in Electrical Engineering, V722, DOI [10.1007/978-981-33-4058-9_6, DOI 10.1007/978-981-33-4058-9_6]
[7]   Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features [J].
Anvarjon, Tursunov ;
Mustaqeem ;
Kwon, Soonil .
SENSORS, 2020, 20 (18) :1-16
[8]   Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition [J].
Atila, Orhan ;
Sengur, Abdulkadir .
APPLIED ACOUSTICS, 2021, 182
[9]   Effects of Data Augmentations on Speech Emotion Recognition [J].
Atmaja, Bagus Tris ;
Sasou, Akira .
SENSORS, 2022, 22 (16)
[10]  
Atsavasirilert K., 2019, P 14 INT JOINT S ART