A Combined CNN Architecture for Speech Emotion Recognition

被引：1

作者：

Begazo, Rolinson ^{[1
]}

Aguilera, Ana ^{[2
,3
]}

Dongo, Irvin ^{[1
,4
]}

Cardinale, Yudith ^{[5
]}

机构：

[1] Univ Catolica San Pablo, Elect & Elect Engn Dept, Arequipa 04001, Peru

[2] Univ Valparaiso, Fac Ingn, Escuela Ingn Informat, Valparaiso 2340000, Chile

[3] Univ Valparaiso, Interdisciplinary Ctr Biomed Res & Hlth Engn MEDIN, Valparaiso 2340000, Chile

[4] Univ Bordeaux, ESTIA Inst Technol, F-64210 Bidart, France

[5] Univ Int Valencia, Grp Invest Ciencia Datos, Valencia 46002, Spain

来源：

SENSORS | 2024年 / 24卷 / 17期

关键词：

speech emotion recognition; deep learning; spectral features; spectrogram imaging; feature fusion; convolutional neural network; NEURAL-NETWORKS; FEATURES; CORPUS;

D O I：

10.3390/s24175797

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Emotion recognition through speech is a technique employed in various scenarios of Human-Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%.

引用

页数：39

共 50 条

[1] BLSTM and CNN Stacking Architecture for Speech Emotion Recognition
Li, Dongdong
Sun, Linyu
Xu, Xinlei
Wang, Zhe
Zhang, Jing
Du, Wenli
NEURAL PROCESSING LETTERS, 2021, 53 (06) : 4097 - 4115
[2] BLSTM and CNN Stacking Architecture for Speech Emotion Recognition
Dongdong Li
Linyu Sun
Xinlei Xu
Zhe Wang
Jing Zhang
Wenli Du
Neural Processing Letters, 2021, 53 : 4097 - 4115
[3] Speech Emotion Recognition Using CNN
Huang, Zhengwei
Dong, Ming
Mao, Qirong
Zhan, Yongzhao
PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 801 - 804
[4] Effective MLP and CNN based ensemble learning for speech emotion recognition
Middya A.I.
Nag B.
Roy S.
Multimedia Tools and Applications, 2024, 83 (36) : 83963 - 83990
[5] The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition
Wen, Xin-Cheng
Liu, Kun-Hong
Zhang, Wei-Ming
Jiang, Kai
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9356 - 9362
[6] Ensemble Learning with CNN-LSTM Combination for Speech Emotion Recognition
Tanberk, Senem
Tukel, Dilek Bilgin
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION NETWORKS (ICCCN 2021), 2022, 394 : 39 - 47
[7] Robust Speech Emotion Recognition System Through Novel ER-CNN and Spectral Features
Zeeshan, Muhammad
Qayoom, Huma
Hassan, Farman
2021 4TH INTERNATIONAL SYMPOSIUM ON ADVANCED ELECTRICAL AND COMMUNICATION TECHNOLOGIES (ISAECT), 2021,
[8] Speech Emotion Recognition using XGBoost and CNN BLSTM with Attention
He, Jingru
Ren, Liyong
2021 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, INTERNET OF PEOPLE, AND SMART CITY INNOVATIONS (SMARTWORLD/SCALCOM/UIC/ATC/IOP/SCI 2021), 2021, : 154 - 159
[9] Comparative Analysis of Windows for Speech Emotion Recognition Using CNN
Teixeira, Felipe L.
Soares, Salviano Pinto
Abreu, J. L. Pio
Oliveira, Paulo M.
Teixeira, Joao P.
OPTIMIZATION, LEARNING ALGORITHMS AND APPLICATIONS, PT I, OL2A 2023, 2024, 1981 : 233 - 248
[10] Speech Emotion Recognition Using Deep Learning Techniques: A Review
Khalil, Ruhul Amin
Jones, Edward
Babar, Mohammad Inayatullah
Jan, Tariqullah
Zafar, Mohammad Haseeb
Alhussain, Thamer
IEEE ACCESS, 2019, 7 : 117327 - 117345

← 1 2 3 4 5 →