An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

被引:60
|
作者
Ahmed, Md. Rayhan [1 ]
Islam, Salekul [1 ]
Islam, A. K. M. Muzahidul [1 ]
Shatabda, Swakkhar [1 ]
机构
[1] United Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
关键词
Speech emotion recognition; Human-computer interaction; 1D CNN GRU LSTM network; Ensemble learning; Data augmentation; FEATURE-SELECTION; 2D CNN; FEATURES; CLASSIFICATION; NETWORK;
D O I
10.1016/j.eswa.2023.119633
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Precise recognition of emotion from speech signals aids in enhancing human-computer interaction (HCI). The performance of a speech emotion recognition (SER) system depends on the derived features from speech signals. However, selecting the optimal set of feature representations remains the most challenging task in SER because the effectiveness of features varies with emotions. Most studies extract hidden local speech features ignoring the global long-term contextual representations of speech signals. The existing SER system suffers from low recog-nition performance mainly due to the scarcity of available data and sub-optimal feature representations. Moti-vated by the efficient feature extraction of convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU), this article proposes an ensemble utilizing the combined predictive performance of three different architectures. The first architecture uses 1D CNN followed by Fully Connected Networks (FCN). In the other two architectures, LSTM-FCN and GRU-FCN layers follow the CNN layer respec-tively. All three individual models focus on extracting both local and long-term global contextual representations of speech signals. The ensemble uses a weighted average of the individual models. We evaluated the model's performance on five benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D. We have augmented the data by injecting additive white gaussian noise, pitch shifting, and stretching the signal level to obtain better model generalization. Five categories of features were extracted from the speech samples: mel-frequency cepstral coefficients, log mel-scaled spectrogram, zero-crossing rate, chromagram, and root mean square value from each audio file in those datasets. All four models perform exceptionally well in the SER task; notably, the ensemble model accomplishes the state-of-the-art (SOTA) weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for CREMA-D datasets and thus significantly outperformed the SOTA models using the same datasets.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] CNN and LSTM based ensemble learning for human emotion recognition using EEG recordings
    Abhishek Iyer
    Srimit Sritik Das
    Reva Teotia
    Shishir Maheshwari
    Rishi Raj Sharma
    Multimedia Tools and Applications, 2023, 82 : 4883 - 4896
  • [22] Ensemble softmax regression model for speech emotion recognition
    Yaxin Sun
    Guihua Wen
    Multimedia Tools and Applications, 2017, 76 : 8305 - 8328
  • [23] Ensemble softmax regression model for speech emotion recognition
    Sun, Yaxin
    Wen, Guihua
    MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (06) : 8305 - 8328
  • [24] Electrodermal Activity for Emotion Recognition Using CNN and Bi-GRU Model
    Zhu, Lili
    Spachos, Petros
    Gregori, Stefano
    ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 5533 - 5538
  • [25] SPEECH EMOTION RECOGNITION WITH MULTISCALE AREA ATTENTION AND DATA AUGMENTATION
    Xu, Mingke
    Zhang, Fan
    Cui, Xiaodong
    Zhang, Wei
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6319 - 6323
  • [26] Improving Speech Emotion Recognition With Adversarial Data Augmentation Network
    Yi, Lu
    Mak, Man-Wai
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (01) : 172 - 184
  • [27] Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning
    Antonio Nicolas, Jose
    de Lope, Javier
    Grana, Manuel
    BIO-INSPIRED SYSTEMS AND APPLICATIONS: FROM ROBOTICS TO AMBIENT INTELLIGENCE, PT II, 2022, 13259 : 279 - 288
  • [28] Silent Speech Recognition: Automatic Lip Reading Model Using 3D CNN and GRU
    Devi, T. Mallika
    Keerthana, Siripurapu
    Santhi, Pentyala
    Pravallika, Puram
    Rajeshwari, Sama
    PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, MACHINE LEARNING AND APPLICATIONS, VOL 1, ICDSMLA 2023, 2025, 1273 : 827 - 832
  • [29] A Data Augmentation Approach for Improving the Performance of Speech Emotion Recognition
    Paraskevopoulou, Georgia
    Spyrou, Evaggelos
    Perantonis, Stavros
    SIGMAP: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA APPLICATIONS, 2022, : 61 - 69
  • [30] Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model
    Atmaja, Bagus Tris
    Akagi, Masato
    2019 IEEE INTERNATIONAL CONFERENCE ON SIGNALS AND SYSTEMS (ICSIGSYS), 2019, : 40 - 44