SPEECH EMOTION RECOGNITION WITH COMPLEMENTARY ACOUSTIC REPRESENTATIONS

被引:2
作者
Zhang, Xiaoming [1 ]
Zhang, Fan [2 ]
Cui, Xiaodong [3 ]
Zhang, Wei [4 ]
机构
[1] Nanjing Tech Univ, Nanjing, Jiangsu, Peoples R China
[2] IBM Data & AI, Armonk, NY USA
[3] IBM Res AI, Albany, NY USA
[4] Wayfair AI, Boston, MA USA
来源
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年
关键词
speech emotion recognition; complementary acoustic representations; convolutional neural network; Transformer; embedding fusion;
D O I
10.1109/SLT54892.2023.10023133
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since CNNs promote local features and Transformers capture long-range dependencies, we explore both models as encoders for acoustic representations in a parallel framework for speech emotion recognition. We choose logMels as input to the CNN encoder and MFCCs to the Transformer encoder. The complementary acoustic representations generated by the two encoders are then fused to predict the frequency distribution of emotions. To further improve the performance, we conduct data augmentation based on vocal tract length perturbation and pretrain the Transformer encoder. The proposed framework is evaluated under the speaker-independent (SI) setting on the improvisation part of the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. Our weighted and unweighted accuracies reached 81.6% and 79.8%, respectively. To the best of our knowledge, this is the state-ofthe-art result reported so far on this dataset in the SI scenario.
引用
收藏
页码:846 / 852
页数:7
相关论文
共 24 条
[1]  
Burkhardt F, 2005, INTERSPEECH, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[2]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[3]   SPEAKER NORMALIZATION FOR SELF-SUPERVISED SPEECH EMOTION RECOGNITION [J].
Gat, Itai ;
Aronowitz, Hagai ;
Zhu, Weizhong ;
Morais, Edmilson ;
Hoory, Ron .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7342-7346
[4]  
Gillioz Anthony, 2020, 2020 15th Conference on Computer Science and Information Systems (FedCSIS), P179, DOI 10.15439/2020F20
[5]   Video Action Transformer Network [J].
Girdhar, Rohit ;
Carreira, Joao ;
Doersch, Carl ;
Zisserman, Andrew .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :244-253
[6]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[7]  
Hinton G. E., 2013, P ICML WORKSH DEEP L
[8]  
Jackson Philip, 2011, Surrey Audio-Visual Expressed Emotion (SAVEE) database
[9]   An Attention Pooling based Representation Learning Method for Speech Emotion Recognition [J].
Li, Pengcheng ;
Song, Yan ;
McLoughlin, Ian ;
Guo, Wu ;
Dai, Lirong .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3087-3091
[10]   CTNet: Conversational Transformer Network for Emotion Recognition [J].
Lian, Zheng ;
Liu, Bin ;
Tao, Jianhua .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :985-1000