An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

被引:60
作者
Ahmed, Md. Rayhan [1 ]
Islam, Salekul [1 ]
Islam, A. K. M. Muzahidul [1 ]
Shatabda, Swakkhar [1 ]
机构
[1] United Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
关键词
Speech emotion recognition; Human-computer interaction; 1D CNN GRU LSTM network; Ensemble learning; Data augmentation; FEATURE-SELECTION; 2D CNN; FEATURES; CLASSIFICATION; NETWORK;
D O I
10.1016/j.eswa.2023.119633
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Precise recognition of emotion from speech signals aids in enhancing human-computer interaction (HCI). The performance of a speech emotion recognition (SER) system depends on the derived features from speech signals. However, selecting the optimal set of feature representations remains the most challenging task in SER because the effectiveness of features varies with emotions. Most studies extract hidden local speech features ignoring the global long-term contextual representations of speech signals. The existing SER system suffers from low recog-nition performance mainly due to the scarcity of available data and sub-optimal feature representations. Moti-vated by the efficient feature extraction of convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU), this article proposes an ensemble utilizing the combined predictive performance of three different architectures. The first architecture uses 1D CNN followed by Fully Connected Networks (FCN). In the other two architectures, LSTM-FCN and GRU-FCN layers follow the CNN layer respec-tively. All three individual models focus on extracting both local and long-term global contextual representations of speech signals. The ensemble uses a weighted average of the individual models. We evaluated the model's performance on five benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D. We have augmented the data by injecting additive white gaussian noise, pitch shifting, and stretching the signal level to obtain better model generalization. Five categories of features were extracted from the speech samples: mel-frequency cepstral coefficients, log mel-scaled spectrogram, zero-crossing rate, chromagram, and root mean square value from each audio file in those datasets. All four models perform exceptionally well in the SER task; notably, the ensemble model accomplishes the state-of-the-art (SOTA) weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for CREMA-D datasets and thus significantly outperformed the SOTA models using the same datasets.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] Data Augmentation for Recognition of Handwritten Words and Lines using a CNN-LSTM Network
    Wigington, Curtis
    Stewart, Seth
    Davis, Brian
    Barrett, Bill
    Price, Brian
    Cohen, Scott
    [J]. 2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 639 - 645
  • [42] Speech emotion recognition using data augmentation method by cycle-generative adversarial networks
    Shilandari, Arash
    Marvi, Hossein
    Khosravi, Hossein
    Wang, Wenwu
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2022, 16 (07) : 1955 - 1962
  • [43] Speech emotion recognition using data augmentation method by cycle-generative adversarial networks
    Arash Shilandari
    Hossein Marvi
    Hossein Khosravi
    Wenwu Wang
    [J]. Signal, Image and Video Processing, 2022, 16 : 1955 - 1962
  • [44] Reinforcement Learning based Data Augmentation for Noise Robust Speech Emotion Recognition
    Ranjan, Sumit
    Chakraborty, Rupayan
    Kopparapu, Sunil Kumar
    [J]. INTERSPEECH 2024, 2024, : 1040 - 1044
  • [45] Wearable IMU-Based Human Activity Recognition Algorithm for Clinical Balance Assessment Using 1D-CNN and GRU Ensemble Model
    Kim, Yeon-Wook
    Joa, Kyung-Lim
    Jeong, Han-Young
    Lee, Sangmin
    [J]. SENSORS, 2021, 21 (22)
  • [46] A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora
    Singh, Youddha Beer
    Goel, Shivani
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (15) : 23055 - 23073
  • [47] MULTI-CONDITIONING AND DATA AUGMENTATION USING GENERATIVE NOISE MODEL FOR SPEECH EMOTION RECOGNITION IN NOISY CONDITIONS
    Tiwari, Upasana
    Soni, Meet
    Chakraborty, Rupayan
    Panda, Ashish
    Kopparapu, Sunil Kumar
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7194 - 7198
  • [48] PRATIT: a CNN-based emotion recognition system using histogram equalization and data augmentation
    Dhara Mungra
    Anjali Agrawal
    Priyanka Sharma
    Sudeep Tanwar
    Mohammad S. Obaidat
    [J]. Multimedia Tools and Applications, 2020, 79 : 2285 - 2307
  • [49] Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files
    Andayani, Felicia
    Theng, Lau Bee
    Tsun, Mark Teekit
    Chua, Caslon
    [J]. IEEE ACCESS, 2022, 10 : 36018 - 36027
  • [50] Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database
    Yu, Yeonguk
    Kim, Yoon-Joong
    [J]. ELECTRONICS, 2020, 9 (05)