Speech Emotion Recognition via Sparse Learning-Based Fusion Model

被引:0
|
作者
Min, Dong-Jin [1 ]
Kim, Deok-Hwan [1 ]
机构
[1] Inha Univ, Dept Elect & Comp Engn, Incheon 22212, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Emotion recognition; Speech recognition; Hidden Markov models; Feature extraction; Brain modeling; Accuracy; Convolutional neural networks; Data models; Time-domain analysis; Deep learning; 2D convolutional neural network squeeze and excitation network; multivariate long short-term memory-fully convolutional network; late fusion; sparse learning; FEATURES; DATABASES; ATTENTION; NETWORK;
D O I
10.1109/ACCESS.2024.3506565
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech communication is a powerful tool for conveying intentions and emotions, fostering mutual understanding, and strengthening relationships. In the realm of natural human-computer interaction, speech-emotion recognition plays a crucial role. This process involves three stages: dataset collection, feature extraction, and emotion classification. Collecting speech-emotion recognition datasets is a complex and costly process, leading to limited data volumes and uneven emotional distributions. This scarcity and imbalance pose significant challenges, affecting the accuracy and reliability of emotion recognition. To address these issues, this study introduces a novel model that is more robust and adaptive. We employ the Ranking Magnitude Method (RMM) based on sparse learning. We use the Root Mean Square (RMS) energy and Zero Crossing Rate (ZCR) as temporal features to measure the speech's overall volume and noise intensity. The Mel Frequency Cepstral Coefficient (MFCC) is utilized to extract critical speech features, which are then integrated into a multivariate Long Short-Term Memory-Fully Convolutional Network (LSTM-FCN) model. We analyze the utterance levels using the log-Mel spectrogram for spatial features, processing these patterns through a 2D Convolutional Neural Network Squeeze and Excitation Network (CNN-SEN) model. The core of our method is a Sparse Learning-Based Fusion Model (SLBF), which addresses dataset imbalances by selectively retraining the underperforming nodes. This dynamic adjustment of learning priorities significantly enhances the robustness and accuracy of emotion recognition. Using this approach, our model outperforms state-of-the-art methods for various datasets, achieving impressive accuracy rates of 97.18%, 97.92%, 99.31%, and 96.89% for the EMOVO, RAVDESS, SAVE, and EMO-DB datasets, respectively.
引用
收藏
页码:177219 / 177235
页数:17
相关论文
共 50 条
  • [1] Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
    Kakuba, Samuel
    Poulose, Alwin
    Han, Dong Seog
    IEEE ACCESS, 2022, 10 : 125538 - 125551
  • [2] Classification of Speech Emotion State Based on Feature Map Fusion of TCN and Pretrained CNN Model From Korean Speech Emotion Data
    Jo, A-Hyeon
    Kwak, Keun-Chang
    IEEE ACCESS, 2025, 13 : 19947 - 19963
  • [3] Speech Emotion Recognition Based on Robust Discriminative Sparse Regression
    Song, Peng
    Zheng, Wenming
    Yu, Yanwei
    Ou, Shifeng
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2021, 13 (02) : 343 - 353
  • [4] The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition
    Uddin, Mohammad Amaz
    Chowdury, Mohammad Salah Uddin
    Khandaker, Mayeen Uddin
    Tamam, Nissren
    Sulieman, Abdelmoneim
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (01): : 1709 - 1722
  • [5] Speech Emotion Recognition and Deep Learning: An Extensive Validation Using Convolutional Neural Networks
    Ri, Francesco Ardan Dal
    Ciardi, Fabio Cifariello
    Conci, Nicola
    IEEE ACCESS, 2023, 11 : 116638 - 116649
  • [6] Speech-Visual Emotion Recognition via Modal Decomposition Learning
    Bai, Lei
    Chang, Rui
    Chen, Guanghui
    Zhou, Yu
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1452 - 1456
  • [7] Multi-View Speech Emotion Recognition Via Collective Relation Construction
    Hou, Mixiao
    Zhang, Zheng
    Cao, Qi
    Zhang, David
    Lu, Guangming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 218 - 229
  • [8] Speech Emotion Recognition Based on Attention MCNN Combined With Gender Information
    Hu, Zhangfang
    LingHu, Kehuan
    Yu, Hongling
    Liao, Chenzhuo
    IEEE ACCESS, 2023, 11 : 50285 - 50294
  • [9] A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
    Lian, Hailun
    Lu, Cheng
    Li, Sunan
    Zhao, Yan
    Tang, Chuangao
    Zong, Yuan
    ENTROPY, 2023, 25 (10)
  • [10] Feature Fusion of Speech Emotion Recognition Based on Deep Learning
    Liu, Gang
    He, Wei
    Jin, Bicheng
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC), 2018, : 193 - 197