Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files

被引:64
作者
Andayani, Felicia [1 ]
Theng, Lau Bee [1 ]
Tsun, Mark Teekit [1 ]
Chua, Caslon [2 ]
机构
[1] Swinburne Univ Technol, Fac Engn Comp & Sci, Sarawak Campus, Sarawak 93350, Malaysia
[2] Swinburne Univ Technol, Fac Sci Engn & Technol, Melbourne, Vic 3122, Australia
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Feature extraction; Speech recognition; Transformers; Emotion recognition; Task analysis; Convolutional neural networks; Spectrogram; Attention mechanism; long short-term memory network; speech emotion recognition; transformer encoder;
D O I
10.1109/ACCESS.2022.3163856
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotion is a vital component in daily human communication and it helps people understand each other. Emotion recognition plays a crucial role in developing human-computer interaction and computer-based speech emotion recognition. In a nutshell, Speech Emotion Recognition (SER) recognizes emotion signals transmitted through human speech or daily conversation where the emotions in a speech strongly depend on temporal information. Despite the fact that much existing research showed that a hybrid system performs better than traditional single classifiers used in SER, there are some limitations in each of them. As a result, this paper discussed a proposed hybrid Long Short-Term Memory (LSTM) Network and Transformer Encoder to learn the long-term dependencies in speech signals and classify emotions. Speech features are extracted with Mel Frequency Cepstral Coefficient (MFCC) and fed into the proposed hybrid LSTM-Transformer classifier. A range of performance evaluations was conducted on the proposed LSTM-Transformer model. The results indicate that it achieves a significant recognition improvement compared with existing models offered by other published works. The proposed hybrid model reached 75.62%, 85.55%, and 72.49% recognition success with the RAVDESS, Emo-DB, and language-independent datasets.
引用
收藏
页码:36018 / 36027
页数:10
相关论文
共 34 条
[1]   Improved speech emotion recognition with Mel frequency magnitude coefficient [J].
Ancilin, J. ;
Milton, A. .
APPLIED ACOUSTICS, 2021, 179
[2]  
[Anonymous], 2019, Social Media and Machine Learning, DOI [DOI 10.5772/INTECHOPEN.84856, 10.5772/intechopen.84856]
[3]   Class-level spectral features for emotion recognition [J].
Bitouk, Dmitri ;
Verma, Ragini ;
Nenkova, Ani .
SPEECH COMMUNICATION, 2010, 52 (7-8) :613-625
[4]  
Burkhardt F, 2005, INTERSPEECH, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[5]  
Craft D., 2018, DIALECTICAL BEHAV TH
[6]  
Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
[7]  
Heusser V., 2019, ARXIV191202610
[8]  
Huang L., 2020, P ICIAI XIAM CHIN, P52, DOI [10.1145/3390557.3394317, DOI 10.1145/3390557.3394317]
[9]  
Huang Z., 2020, ARXIV200307000
[10]   Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition [J].
Jalal, Md Asif ;
Loweimi, Erfan ;
Moore, Roger K. ;
Hain, Thomas .
INTERSPEECH 2019, 2019, :1701-1705