Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files

被引:50
作者
Andayani, Felicia [1 ]
Theng, Lau Bee [1 ]
Tsun, Mark Teekit [1 ]
Chua, Caslon [2 ]
机构
[1] Swinburne Univ Technol, Fac Engn Comp & Sci, Sarawak Campus, Sarawak 93350, Malaysia
[2] Swinburne Univ Technol, Fac Sci Engn & Technol, Melbourne, Vic 3122, Australia
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Feature extraction; Speech recognition; Transformers; Emotion recognition; Task analysis; Convolutional neural networks; Spectrogram; Attention mechanism; long short-term memory network; speech emotion recognition; transformer encoder;
D O I
10.1109/ACCESS.2022.3163856
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotion is a vital component in daily human communication and it helps people understand each other. Emotion recognition plays a crucial role in developing human-computer interaction and computer-based speech emotion recognition. In a nutshell, Speech Emotion Recognition (SER) recognizes emotion signals transmitted through human speech or daily conversation where the emotions in a speech strongly depend on temporal information. Despite the fact that much existing research showed that a hybrid system performs better than traditional single classifiers used in SER, there are some limitations in each of them. As a result, this paper discussed a proposed hybrid Long Short-Term Memory (LSTM) Network and Transformer Encoder to learn the long-term dependencies in speech signals and classify emotions. Speech features are extracted with Mel Frequency Cepstral Coefficient (MFCC) and fed into the proposed hybrid LSTM-Transformer classifier. A range of performance evaluations was conducted on the proposed LSTM-Transformer model. The results indicate that it achieves a significant recognition improvement compared with existing models offered by other published works. The proposed hybrid model reached 75.62%, 85.55%, and 72.49% recognition success with the RAVDESS, Emo-DB, and language-independent datasets.
引用
收藏
页码:36018 / 36027
页数:10
相关论文
共 35 条
  • [1] Improved speech emotion recognition with Mel frequency magnitude coefficient
    Ancilin, J.
    Milton, A.
    [J]. APPLIED ACOUSTICS, 2021, 179
  • [2] [Anonymous], 2019, Social media and machine learning, DOI [10.5772/intechopen.8 4856, DOI 10.5772/INTECHOPEN.84856]
  • [3] Class-level spectral features for emotion recognition
    Bitouk, Dmitri
    Verma, Ragini
    Nenkova, Ani
    [J]. SPEECH COMMUNICATION, 2010, 52 (7-8) : 613 - 625
  • [4] Burkhardt F., 2005, P 9 EUR C SPEECH COM, V5, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
  • [5] Craft D., 2018, DIALECTICAL BEHAV TH
  • [6] Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
  • [7] Heusser V., 2019, ARXIV191202610
  • [8] Huang L., 2020, P ICIAI XIAM CHIN, P52, DOI [10.1145/3390557.3394317, DOI 10.1145/3390557.3394317]
  • [9] Huang Z., 2020, ARXIV200307000
  • [10] Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition
    Jalal, Md Asif
    Loweimi, Erfan
    Moore, Roger K.
    Hain, Thomas
    [J]. INTERSPEECH 2019, 2019, : 1701 - 1705