Speech Emotion Recognition Using Deep Neural Networks, Transfer Learning, and Ensemble Classification Techniques

被引：12

作者：

Mihalache, Serban ^{[1
,2
]}

Burileanu, Dragos ^{[1
]}

机构：

[1] Univ Politehn Bucuresti, Speech & Dialogue Res Lab, Bucharest, Romania

[2] Romanian Acad, Res Inst Artificial Intelligence, Bucharest, Romania

来源：

ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY | 2023年 / 26卷 / 3-4期

关键词：

Convolutional neural networks; deep learning; deep neural networks; machine learning; speech emotion recognition; transfer learning;

D O I：

10.59277/ROMJIST.2023.3-4.10

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Speech emotion recognition (SER) is the task of determining the affective content present in speech, a promising research area of great interest in recent years, with important applications especially in the field of forensic speech and law enforcement operations, among others. In this paper, systems based on deep neural networks (DNNs) spanning five levels of complexity are proposed, developed, and tested, including systems leveraging transfer learning (TL) for the top modern image recognition deep learning models, as well as several ensemble classification techniques that lead to significant performance increases. The systems were tested on the most relevant SER datasets: EMODB, CREMAD, and IEMOCAP, in the context of: (i) classification: using the standard full sets of emotion classes, as well as additional negative emotion subsets relevant for forensic speech applications; and (ii) regression: using the continuously valued 2D arousal-valence affect space. The proposed systems achieved state-of-the-art results for the full class subset for EMODB (up to 83% accuracy) and performance comparable to other published research for the full class subsets for CREMAD and IEMOCAP (up to 55% and 62% accuracy). For the class subsets focusing only on negative affective content, the proposed solutions offered top performance vs. previously published state of the art results.

引用

页码：375 / 387

页数：13

共 32 条

[1]

[Anonymous], 2005, Interspeech, DOI DOI 10.21437/INTERSPEECH.2005-446

[2]

Atmaja BT, 2020, ASIAPAC SIGN INFO PR, P325

[3]

Beard R., 2018, P 22 C COMP NAT LANG, P251

[4] Gender-Driven Emotion Recognition Through Speech Signals for Ambient Intelligence Applications [J].

Bisio, Igor ;

Delfino, Alessandro ;

Lavagetto, Fabio ;

Marchese, Mario ;

Sciarrone, Andrea .

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2013, 1 (02) :244-257

[5] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[6] CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset [J].

Cao, Houwei ;

Cooper, David G. ;

Keutmann, Michael K. ;

Gur, Ruben C. ;

Nenkova, Ani ;

Verma, Ragini .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2014, 5 (04) :377-390

[7]

Casale S., 2008, 2008 Second IEEE International Conference on Semantic Computing (ICSC), P158, DOI 10.1109/ICSC.2008.43

[8]

Chaspari T, 2014, EUR SIGNAL PR CONF, P1552

[9] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].

Chen, Mingyi ;

He, Xuanji ;

Yang, Jing ;

Zhang, Han .

IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444

[10] Xception: Deep Learning with Depthwise Separable Convolutions [J].

Chollet, Francois .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807

← 1 2 3 4 →