Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition

被引:13
作者
Tao, Huawei [1 ]
Geng, Lei [1 ]
Shan, Shuai [1 ]
Mai, Jingchao [1 ]
Fu, Hongliang [1 ]
机构
[1] Henan Univ Technol, Coll Informat Sci & Engn, Zhengzhou 450001, Peoples R China
关键词
speech emotion recognition; feature extraction; hybrid neural network; multi-head attention mechanism; feature fusion; SPECTRAL FEATURES; MODEL;
D O I
10.3390/e24081025
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
The quality of feature extraction plays a significant role in the performance of speech emotion recognition. In order to extract discriminative, affect-salient features from speech signals and then improve the performance of speech emotion recognition, in this paper, a multi-stream convolution-recurrent neural network based on attention mechanism (MSCRNN-A) is proposed. Firstly, a multi-stream sub-branches full convolution network (MSFCN) based on AlexNet is presented to limit the loss of emotional information. In MSFCN, sub-branches are added behind each pooling layer to retain the features of different resolutions, different features from which are fused by adding. Secondly, the MSFCN and Bi-LSTM network are combined to form a hybrid network to extract speech emotion features for the purpose of supplying the temporal structure information of emotional features. Finally, a feature fusion model based on a multi-head attention mechanism is developed to achieve the best fusion features. The proposed method uses an attention mechanism to calculate the contribution degree of different network features, and thereafter realizes the adaptive fusion of different network features by weighting different network features. Aiming to restrain the gradient divergence of the network, different network features and fusion features are connected through shortcut connection to obtain fusion features for recognition. The experimental results on three conventional SER corpora, CASIA, EMODB, and SAVEE, show that our proposed method significantly improves the network recognition performance, with a recognition rate superior to most of the existing state-of-the-art methods.
引用
收藏
页数:13
相关论文
共 50 条
[41]   A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism [J].
Liu, Dong ;
Chen, Longxi ;
Wang, Lifeng ;
Wang, Zhiyong .
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (29) :41677-41695
[42]   Multi-feature Fusion Speech Emotion Recognition Based on SVM [J].
Zeng, Xiaoping ;
Dong, Li ;
Chen, Guanghui ;
Dong, Qi .
PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, :77-80
[43]   Combining Gated Convolutional Networks and Self-Attention Mechanism for Speech Emotion Recognition [J].
Li, Chao ;
Jiao, Jinlong ;
Zhao, Yiqin ;
Zhao, Ziping .
2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS (ACIIW), 2019, :105-109
[44]   A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition [J].
Hu, Dongni ;
Chen, Chengxin ;
Zhang, Pengyuan ;
Li, Junfeng ;
Yan, Yonghong ;
Zhao, Qingwei .
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (08) :1391-1394
[45]   Improvement on Speech Emotion Recognition Based on Deep Convolutional Neural Networks [J].
Niu, Yafeng ;
Zou, Dongsheng ;
Niu, Yadong ;
He, Zhongshi ;
Tan, Hua .
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON COMPUTING AND ARTIFICIAL INTELLIGENCE (ICCAI 2018), 2018, :13-18
[46]   Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients [J].
Manju D. Pawar ;
Rajendra D. Kokate .
Multimedia Tools and Applications, 2021, 80 :15563-15587
[47]   Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients [J].
Pawar, Manju D. ;
Kokate, Rajendra D. .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (10) :15563-15587
[48]   Attention-Based Convolution Skip Bidirectional Long Short-Term Memory Network for Speech Emotion Recognition [J].
Zhang, Huiyun ;
Huang, Heming ;
Han, Henry .
IEEE ACCESS, 2021, 9 :5332-5342
[49]   Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation [J].
Bautista, John Lorenzo ;
Lee, Yun Kyung ;
Shin, Hyun Soon .
ELECTRONICS, 2022, 11 (23)
[50]   Enhancing Emotion Recognition in Speech Based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features [J].
Deeb, Bashar M. ;
Savchenko, Andrey V. ;
Makarov, Ilya .
IEEE ACCESS, 2025, 13 :56283-56295