Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

被引:1
作者
Jin, Zeyu [1 ]
Zai, Wenjiao [1 ]
机构
[1] Sichuan Normal Univ, Inst Technol, 1819,Sect 2,Chenglong Ave, Chengdu 610101, Sichuan, Peoples R China
关键词
Attention mechanics; Audiovisual integration; Deep learning; Emotion recognition; Multimodal;
D O I
10.1007/s11227-024-06582-z
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recognizing and classifying emotions is important. It is key in medicine, transportation, and artificial intelligence. This paper aims to address the problem. The problem is poor emotion classification due to a bad fusion of many modes. It proposes an audio-visual emotion recognition method. It is based on bilayer LSTM and multi-head attention on the RAVDESS dataset. It considers the fusion of voice and facial expression features. The network learns MFCC and facial features. It uses a convolutional layer for MFCC features and a double-layer LSTM for facial features. Then, it fuses the features of both modalities using a multi-head attention module. Finally, it convolves, pools, and splices the learned features for emotion recognition. In our experiments, the accuracy on the public dataset RAVDESS is 82.42%.We got this result using 5-fold cross-validation. Comparison with other methods shows that the method improves audio-visual emotion recognition.
引用
收藏
页数:18
相关论文
共 36 条
[1]   Multimodal human emotion/expression recognition [J].
Chen, LS ;
Huang, TS ;
Miyasato, T ;
Nakatsu, R .
AUTOMATIC FACE AND GESTURE RECOGNITION - THIRD IEEE INTERNATIONAL CONFERENCE PROCEEDINGS, 1998, :366-371
[2]   Self-attention fusion for audiovisual emotion recognition with incomplete data [J].
Chumachenko, Kateryna ;
Iosifidis, Alexandros ;
Gabbouj, Moncef .
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2822-2828
[3]  
De Silva LC, 1997, ICICS - PROCEEDINGS OF 1997 INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATIONS AND SIGNAL PROCESSING, VOLS 1-3, P397, DOI 10.1109/ICICS.1997.647126
[4]  
Ekman P., 2013, Emotion in the human face: Guidelines for research and an integration of findings, V11
[5]   A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features [J].
Er, Mehmet Bilal .
IEEE ACCESS, 2020, 8 :221640-221653
[6]   On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues [J].
Eyben, Florian ;
Woellmer, Martin ;
Graves, Alex ;
Schuller, Bjoern ;
Douglas-Cowie, Ellen ;
Cowie, Roddy .
JOURNAL ON MULTIMODAL USER INTERFACES, 2010, 3 (1-2) :7-19
[7]  
Foo LS, 2020, 2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), P26, DOI [10.1109/ICSGRC49013.2020.9232488, 10.1109/icsgrc49013.2020.9232488]
[8]  
Fu Z., 2021, arXiv, DOI DOI 10.48550/ARXIV.2111.02172
[9]  
Jahangir R, 2021, MULTIMED TOOLS APPL, V80, P23745, DOI 10.1007/s11042-020-09874-7
[10]   Emotion recognition from speech: a review [J].
Koolagudi, Shashidhar G. ;
Rao, K. Sreenivasa .
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2012, 15 (02) :99-117