FRONTEND ATTRIBUTES DISENTANGLEMENT FOR SPEECH EMOTION RECOGNITION

被引:5
作者
Xi, Yu-Xuan [1 ]
Song, Yan [1 ]
Dai, Li-Rong [1 ]
McLoughlin, Ian [1 ,2 ]
Liu, Lin [3 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Peoples R China
[2] Singapore Inst Technol, ICT Cluster, Singapore, Singapore
[3] iFLYTEK CO LTD, iFLYTEK Res, Hefei, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
speech emotion recognition; convolutional neural network; style transformation; disentanglement;
D O I
10.1109/ICASSP43922.2022.9746691
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition (SER) with limited size dataset is a challenging task, since a spoken utterance contains various disturbing attributes besides emotion, including speaker, content, and language. However, due to a close relationship between speaker and emotion attributes, simply fine-tuning a linear model is enough to obtain a good SER performance on the utterance-level embeddings (i.e., ivector and x-vectors) extracted from the pre-trained speaker recognition (SR) frontends. In this paper, we aim to perform frontend attributes disentanglement (AD) for SER task, using a pre-trained SR model. Specifically, the AD module consists of attribute normalization (AN) and attribute reconstruction (AR) phases. The AN filters out the variation information using instance normalization (IN), and AR reconstructs the emotion-relevant features from the residual space to ensure high emotion discrimination. For better disentanglement, a dual space loss is then designed to encourage the separability of emotion-relevant and emotion-irrelevant spaces. To introduce the long-range contextual information for emotion related reconstruction, a time-frequency (TF) attention is further proposed. Different from the style disentanglement of the extracted x-vectors, the proposed AD module can be applied on frontend feature extractor. Experiments on IEMOCAP benchmark demonstrate the effectiveness of the proposed method.
引用
收藏
页码:7712 / 7716
页数:5
相关论文
共 26 条
[1]   Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].
Albanie, Samuel ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301
[2]  
[Anonymous], 2018, COMPUTER VISION PATT
[3]  
Atmaja BT, 2020, INT CONF ACOUST SPEE, P4482, DOI [10.1109/ICASSP40776.2020.9052916, 10.1109/icassp40776.2020.9052916]
[4]   CycleGAN-based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition [J].
Bao, Fang ;
Neumann, Michael ;
Ngoc Thang Vu .
INTERSPEECH 2019, 2019, :2828-2832
[5]   Semisupervised Autoencoders for Speech Emotion Recognition [J].
Deng, Jun ;
Xu, Xinzhou ;
Zhang, Zixing ;
Fruehholz, Sascha ;
Schuller, Bjorn .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (01) :31-43
[6]   Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition [J].
Deng, Jun ;
Xu, Xinzhou ;
Zhang, Zixing ;
Fruhholz, Sascha ;
Schuller, Bjorn .
IEEE SIGNAL PROCESSING LETTERS, 2017, 24 (04) :500-504
[7]   Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition [J].
Deng, Jun ;
Zhang, Zixing ;
Eyben, Florian ;
Schuller, Bjoern .
IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) :1068-1072
[8]   Strip Pooling: Rethinking Spatial Pooling for Scene Parsing [J].
Hou, Qibin ;
Zhang, Li ;
Cheng, Ming-Ming ;
Feng, Jiashi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4002-4011
[9]   Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [J].
Huang, Xun ;
Belongie, Serge .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1510-1519
[10]   Transfer Learning for Improving Speech Emotion Classification Accuracy [J].
Latif, Siddique ;
Rana, Rajib ;
Younis, Shahzad ;
Qadir, Junaid ;
Epps, Julien .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :257-261