Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

被引：1

作者：

Jin, Zeyu ^{[1
]}

Zai, Wenjiao ^{[1
]}

机构：

[1] Sichuan Normal Univ, Inst Technol, 1819,Sect 2,Chenglong Ave, Chengdu 610101, Sichuan, Peoples R China

来源：

JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 01期

关键词：

Attention mechanics; Audiovisual integration; Deep learning; Emotion recognition; Multimodal;

D O I：

10.1007/s11227-024-06582-z

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recognizing and classifying emotions is important. It is key in medicine, transportation, and artificial intelligence. This paper aims to address the problem. The problem is poor emotion classification due to a bad fusion of many modes. It proposes an audio-visual emotion recognition method. It is based on bilayer LSTM and multi-head attention on the RAVDESS dataset. It considers the fusion of voice and facial expression features. The network learns MFCC and facial features. It uses a convolutional layer for MFCC features and a double-layer LSTM for facial features. Then, it fuses the features of both modalities using a multi-head attention module. Finally, it convolves, pools, and splices the learned features for emotion recognition. In our experiments, the accuracy on the public dataset RAVDESS is 82.42%.We got this result using 5-fold cross-validation. Comparison with other methods shows that the method improves audio-visual emotion recognition.

引用

页数：18

共 36 条

[1] Multimodal human emotion/expression recognition [J].

Chen, LS ;

Huang, TS ;

Miyasato, T ;

Nakatsu, R .

AUTOMATIC FACE AND GESTURE RECOGNITION - THIRD IEEE INTERNATIONAL CONFERENCE PROCEEDINGS, 1998, :366-371

[2] Self-attention fusion for audiovisual emotion recognition with incomplete data [J].

Chumachenko, Kateryna ;

Iosifidis, Alexandros ;

Gabbouj, Moncef .

2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2822-2828

[3]

De Silva LC, 1997, ICICS - PROCEEDINGS OF 1997 INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATIONS AND SIGNAL PROCESSING, VOLS 1-3, P397, DOI 10.1109/ICICS.1997.647126

[4]

Ekman P., 2013, Emotion in the human face: Guidelines for research and an integration of findings, V11

[5] A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features [J].

Er, Mehmet Bilal .

IEEE ACCESS, 2020, 8 :221640-221653

[6] On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues [J].

Eyben, Florian ;

Woellmer, Martin ;

Graves, Alex ;

Schuller, Bjoern ;

Douglas-Cowie, Ellen ;

Cowie, Roddy .

JOURNAL ON MULTIMODAL USER INTERFACES, 2010, 3 (1-2) :7-19

[7]

Foo LS, 2020, 2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), P26, DOI [10.1109/ICSGRC49013.2020.9232488, 10.1109/icsgrc49013.2020.9232488]

[8]

Fu Z., 2021, arXiv, DOI DOI 10.48550/ARXIV.2111.02172

[9]

Jahangir R, 2021, MULTIMED TOOLS APPL, V80, P23745, DOI 10.1007/s11042-020-09874-7

[10] Emotion recognition from speech: a review [J].

Koolagudi, Shashidhar G. ;

Rao, K. Sreenivasa .

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (02) :99-117

← 1 2 3 4 →