Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files

被引：64

作者：

Andayani, Felicia ^{[1
]}

Theng, Lau Bee ^{[1
]}

Tsun, Mark Teekit ^{[1
]}

Chua, Caslon ^{[2
]}

机构：

[1] Swinburne Univ Technol, Fac Engn Comp & Sci, Sarawak Campus, Sarawak 93350, Malaysia

[2] Swinburne Univ Technol, Fac Sci Engn & Technol, Melbourne, Vic 3122, Australia

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Feature extraction; Speech recognition; Transformers; Emotion recognition; Task analysis; Convolutional neural networks; Spectrogram; Attention mechanism; long short-term memory network; speech emotion recognition; transformer encoder;

D O I：

10.1109/ACCESS.2022.3163856

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Emotion is a vital component in daily human communication and it helps people understand each other. Emotion recognition plays a crucial role in developing human-computer interaction and computer-based speech emotion recognition. In a nutshell, Speech Emotion Recognition (SER) recognizes emotion signals transmitted through human speech or daily conversation where the emotions in a speech strongly depend on temporal information. Despite the fact that much existing research showed that a hybrid system performs better than traditional single classifiers used in SER, there are some limitations in each of them. As a result, this paper discussed a proposed hybrid Long Short-Term Memory (LSTM) Network and Transformer Encoder to learn the long-term dependencies in speech signals and classify emotions. Speech features are extracted with Mel Frequency Cepstral Coefficient (MFCC) and fed into the proposed hybrid LSTM-Transformer classifier. A range of performance evaluations was conducted on the proposed LSTM-Transformer model. The results indicate that it achieves a significant recognition improvement compared with existing models offered by other published works. The proposed hybrid model reached 75.62%, 85.55%, and 72.49% recognition success with the RAVDESS, Emo-DB, and language-independent datasets.

引用

页码：36018 / 36027

页数：10

共 34 条

[1] Improved speech emotion recognition with Mel frequency magnitude coefficient [J].

Ancilin, J. ;

Milton, A. .

APPLIED ACOUSTICS, 2021, 179

[2]

[Anonymous], 2019, Social Media and Machine Learning, DOI [DOI 10.5772/INTECHOPEN.84856, 10.5772/intechopen.84856]

[3] Class-level spectral features for emotion recognition [J].

Bitouk, Dmitri ;

Verma, Ragini ;

Nenkova, Ani .

SPEECH COMMUNICATION, 2010, 52 (7-8) :613-625

[4]

Burkhardt F, 2005, INTERSPEECH, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446

[5]

Craft D., 2018, DIALECTICAL BEHAV TH

[6]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[7]

Heusser V., 2019, ARXIV191202610

[8]

Huang L., 2020, P ICIAI XIAM CHIN, P52, DOI [10.1145/3390557.3394317, DOI 10.1145/3390557.3394317]

[9]

Huang Z., 2020, ARXIV200307000

[10] Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition [J].

Jalal, Md Asif ;

Loweimi, Erfan ;

Moore, Roger K. ;

Hain, Thomas .

INTERSPEECH 2019, 2019, :1701-1705

← 1 2 3 4 →