End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

被引:19
作者
Tang, Duowei [1 ]
Kuppens, Peter [2 ]
Geurts, Luc [1 ,3 ]
van Waterschoot, Toon [1 ]
机构
[1] Katholieke Univ Leuven, STADIUS Ctr Dynam Syst Signal Proc & Data Analyt, Dept Elect Engn ESAT, Kasteelpk Arenberg 10, B-3001 Leuven, Belgium
[2] Katholieke Univ Leuven, Fac Psychol & Educ Sci, Dekenstr 2, B-3000 Leuven, Belgium
[3] Katholieke Univ Leuven, E Media Res Lab, Andreas Vesaliusstr 13, B-3000 Leuven, Belgium
基金
欧洲研究理事会;
关键词
End-to-end learning; Speech emotion recognition; Dilated causal convolution; Context stacking;
D O I
10.1186/s13636-021-00208-5
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.
引用
收藏
页数:16
相关论文
共 48 条
[1]  
Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
[2]  
[Anonymous], 2016, PROC 9 ISCA SPEEC
[3]   Long short-term memory [J].
Hochreiter, S ;
Schmidhuber, J .
NEURAL COMPUTATION, 1997, 9 (08) :1735-1780
[4]  
Bai S, 2018, ARXIV180301271 CORR
[5]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[6]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444
[7]  
Cho K., 2014, P C EMP METH NAT LAN, P1724, DOI DOI 10.3115/V1/D14-1179
[8]  
Choi Keunwoo, 2015, INT SOC MUSIC INFORM
[9]   Front-End Factor Analysis for Speaker Verification [J].
Dehak, Najim ;
Kenny, Patrick J. ;
Dehak, Reda ;
Dumouchel, Pierre ;
Ouellet, Pierre .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04) :788-798
[10]   Cross-lingual Speech Emotion Recognition through Factor Analysis [J].
Desplanques, Brecht ;
Demuynck, Kris .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3648-3652