HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION

被引:23
作者
Cao, Qi [1 ]
Hou, Mixiao [1 ]
Chen, Bingzhi [1 ]
Zhang, Zheng [1 ]
Lu, Guangming [1 ]
机构
[1] Harbin Inst Technol, Shenzhen, Peoples R China
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Speech Emotion Recognition; Static Features; Dynamic Features; Hierarchical Network; MODEL;
D O I
10.1109/ICASSP39728.2021.9414540
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Many studies on automatic speech emotion recognition (SER) have been devoted to extracting meaningful emotional features for generating emotion-relevant representations. However, they generally ignore the complementary learning of static and dynamic features, leading to limited performances. In this paper, we propose a novel hierarchical network called HNSD that can efficiently integrate the static and dynamic features for SER. Specifically, the proposed HNSD framework consists of three different modules. To capture the discriminative features, an effective encoding module is firstly designed to simultaneously encode both static and dynamic features. By taking the obtained features as inputs, the Gated Multi-features Unit (GMU) is conducted to explicitly determine the emotional intermediate representations for frame-level features fusion, instead of directly fusing these acoustic features. In this way, the learned static and dynamic features can jointly and comprehensively generate the unified feature representations. Benefiting from a well-designed attention mechanism, the last classification module is applied to predict the emotional states at the utterance level. Extensive experiments on the IEMOCAP benchmark dataset demonstrate the superiority of our method in comparison with state-of-the-art baselines.
引用
收藏
页码:6334 / 6338
页数:5
相关论文
共 22 条
[1]  
[Anonymous], 2019, INTERSPEECH, DOI DOI 10.1109/ICCIS49662.2019.00043
[2]  
Arevalo John, 2017, ICLR
[3]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[4]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[5]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444
[6]   Survey on speech emotion recognition: Features, classification schemes, and databases [J].
El Ayadi, Moataz ;
Kamel, Mohamed S. ;
Karray, Fakhri .
PATTERN RECOGNITION, 2011, 44 (03) :572-587
[7]  
Han K, 2014, INTERSPEECH, P223
[8]  
Ioffe S., 2015, PMLR, V37, P448
[9]  
Kingma DP, 2014, ADV NEUR IN, V27
[10]  
Lee J, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P1537