An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

被引:60
作者
Ahmed, Md. Rayhan [1 ]
Islam, Salekul [1 ]
Islam, A. K. M. Muzahidul [1 ]
Shatabda, Swakkhar [1 ]
机构
[1] United Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
关键词
Speech emotion recognition; Human-computer interaction; 1D CNN GRU LSTM network; Ensemble learning; Data augmentation; FEATURE-SELECTION; 2D CNN; FEATURES; CLASSIFICATION; NETWORK;
D O I
10.1016/j.eswa.2023.119633
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Precise recognition of emotion from speech signals aids in enhancing human-computer interaction (HCI). The performance of a speech emotion recognition (SER) system depends on the derived features from speech signals. However, selecting the optimal set of feature representations remains the most challenging task in SER because the effectiveness of features varies with emotions. Most studies extract hidden local speech features ignoring the global long-term contextual representations of speech signals. The existing SER system suffers from low recog-nition performance mainly due to the scarcity of available data and sub-optimal feature representations. Moti-vated by the efficient feature extraction of convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU), this article proposes an ensemble utilizing the combined predictive performance of three different architectures. The first architecture uses 1D CNN followed by Fully Connected Networks (FCN). In the other two architectures, LSTM-FCN and GRU-FCN layers follow the CNN layer respec-tively. All three individual models focus on extracting both local and long-term global contextual representations of speech signals. The ensemble uses a weighted average of the individual models. We evaluated the model's performance on five benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D. We have augmented the data by injecting additive white gaussian noise, pitch shifting, and stretching the signal level to obtain better model generalization. Five categories of features were extracted from the speech samples: mel-frequency cepstral coefficients, log mel-scaled spectrogram, zero-crossing rate, chromagram, and root mean square value from each audio file in those datasets. All four models perform exceptionally well in the SER task; notably, the ensemble model accomplishes the state-of-the-art (SOTA) weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for CREMA-D datasets and thus significantly outperformed the SOTA models using the same datasets.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
    Mishra, Swami
    Bhatnagar, Nehal
    Prakasam, P.
    Sureshkumar, T. R.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (13) : 37603 - 37620
  • [32] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
    Swami Mishra
    Nehal Bhatnagar
    Prakasam P
    Sureshkumar T. R
    Multimedia Tools and Applications, 2024, 83 : 37603 - 37620
  • [33] Speech emotion recognition model based on Bi-GRU and Focal Loss
    Zhu, Zijiang
    Dai, Weihuang
    Hu, Yi
    Li, Junshan
    PATTERN RECOGNITION LETTERS, 2020, 140 : 358 - 365
  • [34] 1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features
    Mustaqeem
    Kwon, Soonil
    CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 67 (03): : 4039 - 4059
  • [35] PRATIT: a CNN-based emotion recognition system using histogram equalization and data augmentation
    Mungra, Dhara
    Agrawal, Anjali
    Sharma, Priyanka
    Tanwar, Sudeep
    Obaidat, Mohammad S.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (3-4) : 2285 - 2307
  • [36] FI-Net: A Speech Emotion Recognition Framework with Feature Integration and Data Augmentation
    Xia, Guangmin
    Li, Fan
    Zhao, Dongdi
    Zhang, Qian
    Yang, Song
    5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 195 - 203
  • [37] Real-time speech emotion recognition using deep learning and data augmentation
    Barhoumi, Chawki
    Benayed, Yassine
    ARTIFICIAL INTELLIGENCE REVIEW, 2024, 58 (02)
  • [38] Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion
    Liu Y.
    Chen A.
    Zhou G.
    Yi J.
    Xiang J.
    Wang Y.
    Multimedia Tools and Applications, 2024, 83 (21) : 59839 - 59859
  • [39] Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation
    Roh, Kyung-Min
    Lee, Seok-Pil
    APPLIED SCIENCES-BASEL, 2024, 14 (21):
  • [40] Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition
    Wang, Shijun
    Hemati, Hamed
    Gudnason, Jon
    Borth, Damian
    INTERSPEECH 2022, 2022, : 391 - 395