Speech Emotion Recognition via Sparse Learning-Based Fusion Model

被引:0
作者
Min, Dong-Jin [1 ]
Kim, Deok-Hwan [1 ]
机构
[1] Inha Univ, Dept Elect & Comp Engn, Incheon 22212, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Emotion recognition; Speech recognition; Hidden Markov models; Feature extraction; Brain modeling; Accuracy; Convolutional neural networks; Data models; Time-domain analysis; Deep learning; 2D convolutional neural network squeeze and excitation network; multivariate long short-term memory-fully convolutional network; late fusion; sparse learning; FEATURES; DATABASES; ATTENTION; NETWORK;
D O I
10.1109/ACCESS.2024.3506565
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech's overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.
引用
收藏
页码:177219 / 177235
页数:17
相关论文
共 50 条
  • [21] Anchor Model Fusion for Emotion Recognition in Speech
    Ortego-Resa, Carlos
    Lopez-Moreno, Ignacio
    Ramos, Daniel
    Gonzalez-Rodriguez, Joaquin
    [J]. BIOMETRIC ID MANAGEMENT AND MULTIMODAL COMMUNICATION, PROCEEDINGS, 2009, 5707 : 49 - 56
  • [22] Emotion Recognition in EEG Based on Multilevel Multidomain Feature Fusion
    Li, Zhao Long
    Cao, Hui
    Zhang, Ji Sai
    [J]. IEEE ACCESS, 2024, 12 : 87237 - 87247
  • [23] EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition
    Gerczuk, Maurice
    Amiriparian, Shahin
    Ottl, Sandra
    Schuller, Bjorn W. W.
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (02) : 1472 - 1487
  • [24] Learning Salient Segments for Speech Emotion Recognition Using Attentive Temporal Pooling
    Xia, Xiaohan
    Jiang, Dongmei
    Sahli, Hichem
    [J]. IEEE ACCESS, 2020, 8 (08): : 151740 - 151752
  • [25] Speech emotion recognition via learning analogies
    Ntalampiras, Stavros
    [J]. PATTERN RECOGNITION LETTERS, 2021, 144 : 21 - 26
  • [26] Pattern recognition and features selection for speech emotion recognition model using deep learning
    Jermsittiparsert, Kittisak
    Abdurrahman, Abdurrahman
    Siriattakul, Parinya
    Sundeeva, Ludmila A.
    Hashim, Wahidah
    Rahim, Robbi
    Maseleno, Andino
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (04) : 799 - 806
  • [27] Pattern recognition and features selection for speech emotion recognition model using deep learning
    Kittisak Jermsittiparsert
    Abdurrahman Abdurrahman
    Parinya Siriattakul
    Ludmila A. Sundeeva
    Wahidah Hashim
    Robbi Rahim
    Andino Maseleno
    [J]. International Journal of Speech Technology, 2020, 23 : 799 - 806
  • [28] Robust emotion recognition in noisy speech via sparse representation
    Zhao, Xiaoming
    Zhang, Shiqing
    Lei, Bicheng
    [J]. NEURAL COMPUTING & APPLICATIONS, 2014, 24 (7-8) : 1539 - 1553
  • [29] Robust emotion recognition in noisy speech via sparse representation
    Xiaoming Zhao
    Shiqing Zhang
    Bicheng Lei
    [J]. Neural Computing and Applications, 2014, 24 : 1539 - 1553
  • [30] Deep learning based Affective Model for Speech Emotion Recognition
    Zhou, Xi
    Guo, Junqi
    Bie, Rongfang
    [J]. 2016 INT IEEE CONFERENCES ON UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING AND COMMUNICATIONS, CLOUD AND BIG DATA COMPUTING, INTERNET OF PEOPLE, AND SMART WORLD CONGRESS (UIC/ATC/SCALCOM/CBDCOM/IOP/SMARTWORLD), 2016, : 841 - 846