Evaluating deep learning architectures for Speech Emotion Recognition

被引:356
作者
Fayek, Haytham M. [1 ]
Lech, Margaret [1 ]
Cavedon, Lawrence [2 ]
机构
[1] RMIT Univ, Sch Engn, Melbourne, Vic 3001, Australia
[2] RMIT Univ, Sch Sci, Melbourne, Vic 3001, Australia
关键词
Affective computing; Deep learning; Emotion recognition; Neural networks; Speech recognition; NEURAL-NETWORKS; FEATURES;
D O I
10.1016/j.neunet.2017.02.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:60 / 68
页数:9
相关论文
共 49 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]  
[Anonymous], 2008, THESIS
[3]  
[Anonymous], 2004, THESIS
[4]  
[Anonymous], PROC CVPR IEEE
[5]  
[Anonymous], 2015, ARXIV PREPRINT ARXIV
[6]  
[Anonymous], 2021, NEURAL NETW MACH
[7]  
[Anonymous], 1997, Neural Computation
[8]  
[Anonymous], 2016, DEEP LEARNING
[9]  
[Anonymous], 2014, INTERSPEECH 2014
[10]  
[Anonymous], 7 WORKSH DISFL SPONT