A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition

被引:53
作者
Chen, Ming [1 ,2 ]
Zhao, Xudong [2 ]
机构
[1] Zhejiang Univ, 38 Zheda Rd, Hangzhou, Peoples R China
[2] Hithink RoyalFlush Informat Network Co Ltd, Hangzhou, Peoples R China
来源
INTERSPEECH 2020 | 2020年
关键词
speech emotion recognition; bimodal; multi-scale fusion strategy; feature fusion; ensemble learning; FEATURES;
D O I
10.21437/Interspeech.2020-3156
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition (SER) is a challenging task that requires to learn suitable features for achieving good performance. The development of deep learning techniques makes it possible to automatically extract features rather than construct hand-crafted features. In this paper, a multi-scale fusion framework named STSER is proposed for bimodal SER by using speech and text information. A smodel, which takes advantage of convolutional neural network (CNN), bi-directional long short-term memory (Bi-LSTM) and the attention mechanism, is proposed to learn speech representation from the logmel spectrogram extracted from speech data. Specifically, the CNN layers are utilized to learn local correlations. Then the Bi-LSTM layer is applied to learn long-term dependencies and contextual information. Finally, the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A tmodel using a pre-trained ALBERT model is applied for learning text representation from text data. Finally, a multi-scale fusion strategy, including feature fusion and ensemble learning, is applied to improve the overall performance. Experiments conducted on the public emotion dataset IEMOCAP have shown that the proposed STSER can achieve comparable recognition accuracy with fewer feature inputs.
引用
收藏
页码:374 / 378
页数:5
相关论文
共 23 条
[1]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[2]   All You Need is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification [J].
Chen, Weijie ;
Xie, Di ;
Zhang, Yuan ;
Pu, Shiliang .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7234-7243
[3]   Deep neural networks for emotion recognition combining audio and transcripts [J].
Cho, Jaejin ;
Pappagari, Raghavendra ;
Kulkarni, Purva ;
Villalba, Jesus ;
Carmiel, Yishay ;
Dehak, Najim .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :247-251
[4]  
Demircan S., 2014, Journal of Advances in Computer Networks, V2, P34, DOI 10.7763/JACN.2014.V2.76
[5]   Survey on speech emotion recognition: Features, classification schemes, and databases [J].
El Ayadi, Moataz ;
Kamel, Mohamed S. ;
Karray, Fakhri .
PATTERN RECOGNITION, 2011, 44 (03) :572-587
[6]  
Heusser V., 2019, CORR
[7]  
Hifny Y, 2019, INT CONF ACOUST SPEE, P6710, DOI [10.1109/icassp.2019.8683632, 10.1109/ICASSP.2019.8683632]
[8]  
Kenton J. D. M.-W. C., 2019, Long and Short Papers, P4171
[9]   Emotion Recognition from Human Speech Using Temporal Information and Deep Learning [J].
Kim, John W. ;
Saurous, Rif A. .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :937-940
[10]  
Lan Z., 2019, ALBERT LITE BERT SEL