A Combined CNN Architecture for Speech Emotion Recognition

被引：1

作者：

Begazo, Rolinson ^{[1
]}

Aguilera, Ana ^{[2
,3
]}

Dongo, Irvin ^{[1
,4
]}

Cardinale, Yudith ^{[5
]}

机构：

[1] Univ Catolica San Pablo, Elect & Elect Engn Dept, Arequipa 04001, Peru

[2] Univ Valparaiso, Fac Ingn, Escuela Ingn Informat, Valparaiso 2340000, Chile

[3] Univ Valparaiso, Interdisciplinary Ctr Biomed Res & Hlth Engn MEDIN, Valparaiso 2340000, Chile

[4] Univ Bordeaux, ESTIA Inst Technol, F-64210 Bidart, France

[5] Univ Int Valencia, Grp Invest Ciencia Datos, Valencia 46002, Spain

来源：

SENSORS | 2024年 / 24卷 / 17期

关键词：

speech emotion recognition; deep learning; spectral features; spectrogram imaging; feature fusion; convolutional neural network; NEURAL-NETWORKS; FEATURES; CORPUS;

D O I：

10.3390/s24175797

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Emotion recognition through speech is a technique employed in various scenarios of Human-Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%.

引用

页数：39

共 50 条

[41] Memristor-Based Progressive Hierarchical Conformer Architecture for Speech Emotion Recognition
Zhao, Tianhao
Zhou, Yue
Hu, Xiaofang
INTERNATIONAL JOURNAL OF BIFURCATION AND CHAOS, 2024, 34 (09):
[42] Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs
Liu, Zhen-Tao
Xiao, Peng
Li, Dan-Yun
Hao, Man
INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2019, PT III, 2019, 11742 : 481 - 491
[43] Multichannel CNN-BLSTM Architecture for Speech Emotion Recognition System by Fusion of Magnitude and Phase Spectral Features Using DCCA for Consumer Applications
Prabhakar, Gudmalwar Ashishkumar
Basel, Biplove
Dutta, Anirban
Rao, Ch. V. Rama
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2023, 69 (02) : 226 - 235
[44] SPEECH EMOTION RECOGNITION WITH DUAL-SEQUENCE LSTM ARCHITECTURE
Wang, Jianyou
Xue, Michael
Culhane, Ryan
Diao, Enmao
Ding, Jie
Tarokh, Vahid
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6474 - 6478
[45] A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora
Singh, Youddha Beer
Goel, Shivani
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (15) : 23055 - 23073
[46] Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT
Dridi, Hinda
Ouni, Kais
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 525 - 534
[47] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
Mishra, Swami
Bhatnagar, Nehal
Prakasam, P.
Sureshkumar, T. R.
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (13) : 37603 - 37620
[48] Multi-View Speech Emotion Recognition Via Collective Relation Construction
Hou, Mixiao
Zhang, Zheng
Cao, Qi
Zhang, David
Lu, Guangming
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 218 - 229
[49] Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model
Swami Mishra
Nehal Bhatnagar
Prakasam P
Sureshkumar T. R
Multimedia Tools and Applications, 2024, 83 : 37603 - 37620
[50] A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition
Mustaqeem
Kwon, Soonil
SENSORS, 2020, 20 (01)

← 1 2 3 4 5 →