Multimodal emotion recognition from facial expression and speech based on feature fusion

被引:8
作者
Tang, Guichen [1 ]
Xie, Yue [1 ]
Li, Ke [2 ]
Liang, Ruiyu [1 ]
Zhao, Li [2 ]
机构
[1] Nanjing Inst Technol, Sch Informat & Commun Engn, Nanjing, Peoples R China
[2] Southeast Univ, Informat Sci & Engn, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal emotion recognition; Attention mechanism; Deep learning; Feature fusion;
D O I
10.1007/s11042-022-14185-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multimodal emotion recognition is designed to use expression and speech information to identify individual behaviors. Feature fusion can enrich various modal information, which is an important method for multimodal emotion recognition. However, there are several modal information synchronizations and overfitting problems due to large feature dimensions. So, an attention mechanism is introduced to automate the network to pay attention to local effective information. It is used to perform audio and video feature fusion tasks and timing modeling tasks in the network. The main contributions are as follows: 1) the multi-head self-attention mechanism is used for feature fusion of audio and video data to avoid the influence of prior information on the fusion results, and 2) a bidirectional gated recurrent unit is used to model the time series of fusion features; furthermore, the autocorrelation coefficient in the time dimension is also calculated as attention for fusion. Experiment results show that the adopted attention mechanism can effectively improve the accuracy of multimodal emotion recognition.
引用
收藏
页码:16359 / 16373
页数:15
相关论文
共 46 条
  • [1] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
    Albanie, Samuel
    Nagrani, Arsha
    Vedaldi, Andrea
    Zisserman, Andrew
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 292 - 301
  • [2] Ansari H, 2018, 2018 CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (CICT'18)
  • [3] Realistic Transformation of Facial and Vocal Smiles in Real-Time Audiovisual Streams
    Arias, Pablo
    Soladie, Catherine
    Bouafif, Oussema
    Roebel, Axel
    Seguier, Renaud
    Aucouturier, Jean-Julien
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2020, 11 (03) : 507 - 518
  • [4] Audiovisual emotion recognition in wild
    Avots, Egils
    Sapinski, Tomasz
    Bachmann, Maie
    Kaminska, Dorota
    [J]. MACHINE VISION AND APPLICATIONS, 2019, 30 (05) : 975 - 985
  • [5] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [6] Beard R., 2018, 22 C COMP NAT LANG L, P251
  • [7] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
    Chen, Ming
    Zhao, Xudong
    [J]. INTERSPEECH 2020, 2020, : 374 - 378
  • [8] Cho K., 2014, C EMP METH NAT LANG, DOI [10.48550/arXiv.1406.1078, DOI 10.48550/ARXIV.1406.1078, DOI 10.3115/V1/D14-1179]
  • [9] Dedeoglu Mehmet, 2019, 2019 International Conference on Data Mining Workshops (ICDMW). Proceedings, P131, DOI 10.1109/ICDMW.2019.00029
  • [10] Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
    Dung Nguyen
    Kien Nguyen
    Sridharan, Sridha
    Dean, David
    Fookes, Clinton
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 174 : 33 - 42