A Combined CNN Architecture for Speech Emotion Recognition

被引:1
|
作者
Begazo, Rolinson [1 ]
Aguilera, Ana [2 ,3 ]
Dongo, Irvin [1 ,4 ]
Cardinale, Yudith [5 ]
机构
[1] Univ Catolica San Pablo, Elect & Elect Engn Dept, Arequipa 04001, Peru
[2] Univ Valparaiso, Fac Ingn, Escuela Ingn Informat, Valparaiso 2340000, Chile
[3] Univ Valparaiso, Interdisciplinary Ctr Biomed Res & Hlth Engn MEDIN, Valparaiso 2340000, Chile
[4] Univ Bordeaux, ESTIA Inst Technol, F-64210 Bidart, France
[5] Univ Int Valencia, Grp Invest Ciencia Datos, Valencia 46002, Spain
关键词
speech emotion recognition; deep learning; spectral features; spectrogram imaging; feature fusion; convolutional neural network; NEURAL-NETWORKS; FEATURES; CORPUS;
D O I
10.3390/s24175797
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Emotion recognition through speech is a technique employed in various scenarios of Human-Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%.
引用
收藏
页数:39
相关论文
共 50 条
  • [31] CyTex: Transforming speech to textured images for speech emotion recognition
    Bakhshi, Ali
    Harimi, Ali
    Chalup, Stephan
    SPEECH COMMUNICATION, 2022, 139 : 62 - 75
  • [32] Speech Emotion Recognition Based on Attention MCNN Combined With Gender Information
    Hu, Zhangfang
    LingHu, Kehuan
    Yu, Hongling
    Liao, Chenzhuo
    IEEE ACCESS, 2023, 11 : 50285 - 50294
  • [33] Hybrid Time Distributed CNN-transformer for Speech Emotion Recognition
    Slimi, Anwer
    Nicolas, Henri
    Zrigui, Mounir
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON SOFTWARE TECHNOLOGIES (ICSOFT), 2022, : 602 - 611
  • [34] Gender-Aware CNN-BLSTM for Speech Emotion Recognition
    Zhang, Linjuan
    Wang, Longbiao
    Dang, Jianwu
    Guo, Lili
    Yu, Qiang
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT I, 2018, 11139 : 782 - 790
  • [35] A Comprehensive Review of Speech Emotion Recognition Systems
    Wani, Taiba Majid
    Gunawan, Teddy Surya
    Qadri, Syed Asif Ahmad
    Kartiwi, Mira
    Ambikairajah, Eliathamby
    IEEE ACCESS, 2021, 9 : 47795 - 47814
  • [36] Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
    Nam, Youngja
    Lee, Chankyu
    SENSORS, 2021, 21 (13)
  • [37] Speech Emotion Recognition Based on Feature Fusion
    Shen, Qi
    Chen, Guanggen
    Chang, Lin
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 1071 - 1074
  • [38] A statistical feature extraction for deep speech emotion recognition in a bilingual scenario
    Sekkate, Sara
    Khalil, Mohammed
    Adib, Abdellah
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (08) : 11443 - 11460
  • [39] A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism
    Lieskovska, Eva
    Jakubec, Maros
    Jarina, Roman
    Chmulik, Michal
    ELECTRONICS, 2021, 10 (10)
  • [40] Speech emotion recognition using the novel PEmoNet (Parallel Emotion Network)
    Bhangale, Kishor B.
    Kothandaraman, Mohanaprasad
    APPLIED ACOUSTICS, 2023, 212