Robust Speech Emotion Recognition Using CNN plus LSTM Based on Stochastic Fractal Search Optimization Algorithm

被引:93
作者
Abdelhamid, Abdel Aziza [1 ,2 ]
El-Kenawy, El-Sayed M. [3 ,4 ]
Alotaibi, Bandar [5 ,6 ]
Amer, Ghadam [7 ]
Abdelkader, Mahmoud Y.
Ibrahim, Abdelhameed [8 ]
Eid, Marwa Metwally
机构
[1] Ain Shams Univ, Fac Comp & Informat Sci, Dept Comp Sci, Cairo 11566, Egypt
[2] Shaqra Univ, Dept Comp Sci, Coll Comp & Informat Technol, Riyadh 11961, Saudi Arabia
[3] Delta Higher Inst Engn & Technol DHIET, Dept Commun & Elect, Mansoura 35111, Egypt
[4] Delta Univ Sci & Technol, Fac Articial Intelligence, Mansoura 35712, Egypt
[5] Univ Tabuk, Fac Comp & Informat Technol, Dept Informat Technol, Tabuk 71491, Saudi Arabia
[6] Univ Tabuk, Sensors Networks & Cellular Syst Res Ctr, Tabuk 71491, Saudi Arabia
[7] Benha Univ, Dept Elect Engn, Fac Engn, Banha 13511, Egypt
[8] Mansoura Univ, Dept Comp Engn & Control Syst, Fac Engn, Mansoura 35516, Egypt
关键词
Deep learning; Speech recognition; Convolutional neural networks; Emotion recognition; Feature extraction; Optimization; Training; Speech emotions; deep learning; stochastic fractal search optimization; guided whale optimization algorithm; DEEP LEARNING ARCHITECTURES; NEURAL-NETWORKS; RECURRENT; MODEL; SELECTION;
D O I
10.1109/ACCESS.2022.3172954
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One of the main challenges facing the current approaches of speech emotion recognition is the lack of a dataset large enough to train the currently available deep learning models properly. Therefore, this paper proposes a new data augmentation algorithm to enrich the speech emotions dataset with more sam Department, College of Computing and ples through a careful addition of noise fractions. In addition, the hyperparameters of the currently available deep learning models are either handcrafted or adjusted during the training process. However, this approach does not guarantee finding the best settings for these parameters. Therefore, we propose an optimized deep learning model in which the hyperparameters are optimized to find their best settings and thus achieve more recognition results. This deep learning model consists of a convolutional neural network (CNN) composed of four local feature-learning blocks and a long short-term memory (LSTM) layer for learning local and long-term correlations in the log Mel-spectrogram of the input speech samples. To improve the performance of this deep network, the learning rate and label smoothing regularization factor are optimized using the recently emerged stochastic fractal search (SFS)-guided whale optimization algorithm (WOA). The strength of this algorithm is the ability to balance between the exploration and exploitation of the search agents' positions to guarantee to reach the optimal global solution. To prove the effectiveness of the proposed approach, four speech emotion datasets, namely, IEMOCAP, Emo-DB, RAVDESS, and SAVEE, are incorporated in the conducted experiments. Experimental results confirmed the superiority of the proposed approach when compared with state-of-the-art approaches. Based on the four datasets, the achieved recognition accuracies are 98.13%, 99.76%, 99.47%, and 99.50%, respectively. Moreover, a statistical analysis of the achieved results is provided to emphasize the stability of the proposed approach.
引用
收藏
页码:49265 / 49284
页数:20
相关论文
共 63 条
[1]   Model selection for ecologists: the worldviews of AIC and BIC [J].
Aho, Ken ;
Derryberry, DeWayne ;
Peterson, Teri .
ECOLOGY, 2014, 95 (03) :631-636
[2]   Binary Optimization Using Hybrid Grey Wolf Optimization for Feature Selection [J].
Al-Tashi, Qasem ;
Kadir, Said Jadid Abdul ;
Rais, Helmi Md ;
Mirjalili, Seyedali ;
Alhussian, Hitham .
IEEE ACCESS, 2019, 7 :39496-39508
[3]  
[Anonymous], 1995, PROC IJCAI
[4]   Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition [J].
Atila, Orhan ;
Sengur, Abdulkadir .
APPLIED ACOUSTICS, 2021, 182
[5]  
Behnke S, 2003, IEEE IJCNN, P2758
[6]   Consciousness is not a property of states: A reply to Wilberg [J].
Berger, Jacob .
PHILOSOPHICAL PSYCHOLOGY, 2014, 27 (06) :829-842
[7]  
Bergstra J., 2011, Adv. Neural Inf. Process. Syst., P2546
[8]  
Boser B. E., 1991, IJCNN-91-Seattle: International Joint Conference on Neural Networks (Cat. No.91CH3049-4), P415, DOI 10.1109/IJCNN.1991.155214
[9]  
Burkhardt F, 2005, INTERSPEECH, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[10]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359