Multimodal emotion recognition from facial expression and speech based on feature fusion

被引：8

作者：

Tang, Guichen ^{[1
]}

Xie, Yue ^{[1
]}

Li, Ke ^{[2
]}

Liang, Ruiyu ^{[1
]}

Zhao, Li ^{[2
]}

机构：

[1] Nanjing Inst Technol, Sch Informat & Commun Engn, Nanjing, Peoples R China

[2] Southeast Univ, Informat Sci & Engn, Nanjing, Peoples R China

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 82卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Multimodal emotion recognition; Attention mechanism; Deep learning; Feature fusion;

D O I：

10.1007/s11042-022-14185-0

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multimodal emotion recognition is designed to use expression and speech information to identify individual behaviors. Feature fusion can enrich various modal information, which is an important method for multimodal emotion recognition. However, there are several modal information synchronizations and overfitting problems due to large feature dimensions. So, an attention mechanism is introduced to automate the network to pay attention to local effective information. It is used to perform audio and video feature fusion tasks and timing modeling tasks in the network. The main contributions are as follows: 1) the multi-head self-attention mechanism is used for feature fusion of audio and video data to avoid the influence of prior information on the fusion results, and 2) a bidirectional gated recurrent unit is used to model the time series of fusion features; furthermore, the autocorrelation coefficient in the time dimension is also calculated as attention for fusion. Experiment results show that the adopted attention mechanism can effectively improve the accuracy of multimodal emotion recognition.

引用

页码：16359 / 16373

页数：15

共 46 条

[1] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Albanie, Samuel
Nagrani, Arsha
Vedaldi, Andrea
Zisserman, Andrew
[J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 292 - 301
[2] Ansari H, 2018, 2018 CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (CICT'18)
[3] Realistic Transformation of Facial and Vocal Smiles in Real-Time Audiovisual Streams
Arias, Pablo
Soladie, Catherine
Bouafif, Oussema
Roebel, Axel
Seguier, Renaud
Aucouturier, Jean-Julien
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2020, 11 (03) : 507 - 518
[4] Audiovisual emotion recognition in wild
Avots, Egils
Sapinski, Tomasz
Bachmann, Maie
Kaminska, Dorota
[J]. MACHINE VISION AND APPLICATIONS, 2019, 30 (05) : 975 - 985
[5] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[6] Beard R., 2018, 22 C COMP NAT LANG L, P251
[7] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
Chen, Ming
Zhao, Xudong
[J]. INTERSPEECH 2020, 2020, : 374 - 378
[8] Cho K., 2014, C EMP METH NAT LANG, DOI [10.48550/arXiv.1406.1078, DOI 10.48550/ARXIV.1406.1078, DOI 10.3115/V1/D14-1179]
[9] Dedeoglu Mehmet, 2019, 2019 International Conference on Data Mining Workshops (ICDMW). Proceedings, P131, DOI 10.1109/ICDMW.2019.00029
[10] Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
Dung Nguyen
Kien Nguyen
Sridharan, Sridha
Dean, David
Fookes, Clinton
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 174 : 33 - 42

← 1 2 3 4 5 →