Autoencoder With Emotion Embedding for Speech Emotion Recognition

被引:29
作者
Zhang, Chenghao [1 ]
Xue, Lei [1 ]
机构
[1] Shanghai Univ, Sch Commun & Informat Engn, Shanghai 200444, Peoples R China
关键词
Feature extraction; Speech recognition; Emotion recognition; Spectrogram; Noise reduction; Hidden Markov models; Acoustics; Speech emotion recognition; autoencoder; emotion embedding; instance normalization; GENERATION;
D O I
10.1109/ACCESS.2021.3069818
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An important part of the human-computer interaction process is speech emotion recognition (SER), which has been receiving more attention in recent years. However, although a wide diversity of methods has been proposed in SER, these approaches still cannot improve the performance. A key issue in the low performance of the SER system is how to effectively extract emotion-oriented features. In this paper, we propose a novel algorithm, an autoencoder with emotion embedding, to extract deep emotion features. Unlike many previous works, instance normalization, which is a common technique in the style transfer field, is introduced into our model rather than batch normalization. Furthermore, the emotion embedding path in our method can lead the autoencoder to efficiently learn a priori knowledge from the label. It can enable the model to distinguish which features are most related to human emotion. We concatenate the latent representation learned by the autoencoder and acoustic features obtained by the openSMILE toolkit. Finally, the concatenated feature vector is utilized for emotion classification. To improve the generalization of our method, a simple data augmentation approach is applied. Two publicly available and highly popular databases, IEMOCAP and EMODB, are chosen to evaluate our method. Experimental results demonstrate that the proposed model achieves significant performance improvement compared to other speech emotion recognition systems.
引用
收藏
页码:51231 / 51241
页数:11
相关论文
共 72 条
[51]  
Samantaray AK, 2015, 2015 IEEE 2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN INFORMATION SYSTEMS (RETIS), P372, DOI 10.1109/ReTIS.2015.7232907
[52]  
Schuller B, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2798
[53]  
Schuller B, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P336
[54]   Speech Emotion Recognition Two Decades in a Nutshell, Benchmarks, and Ongoing Trends [J].
Schuller, Bjoern W. .
COMMUNICATIONS OF THE ACM, 2018, 61 (05) :90-99
[55]  
Shah M, 2014, IEEE INT SYMP CIRC S, P754, DOI 10.1109/ISCAS.2014.6865245
[56]  
Sidorov Maxim, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P4803, DOI 10.1109/ICASSP.2014.6854514
[57]   Static and Dynamic Source Separation Using Nonnegative Factorizations [A unifed view] [J].
Smaragdis, Paris ;
Fevotte, Cedric ;
Mysore, Gautham J. ;
Mohammadiha, Nasser ;
Hoffman, Matthew .
IEEE SIGNAL PROCESSING MAGAZINE, 2014, 31 (03) :66-75
[58]   Transfer Linear Subspace Learning for Cross-Corpus Speech Emotion Recognition [J].
Song, Peng .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (02) :265-275
[59]   Direct modeling of frequency spectra and waveform generation based on phase recovery for DNN-based speech synthesis [J].
Takaki, Shinji ;
Kameoka, Hirokazu ;
Yamagishi, Junichi .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1128-1132
[60]   Towards Robust Speech Emotion Recognition using Deep Residual Networks for Speech Enhancement [J].
Triantafyllopoulos, Andreas ;
Keren, Gil ;
Wagner, Johannes ;
Steiner, Ingmar ;
Schuller, Bjorn W. .
INTERSPEECH 2019, 2019, :1691-1695