Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities

被引:0
作者
Middya, Asif Iqbal [1 ]
Nag, Baibhav [2 ]
Roy, Sarbani [1 ]
机构
[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata, India
[2] Jadavpur Univ, Dept Math, Kolkata, India
关键词
Multimodal emotion recognition; Audio features; Video features; Classification; Deep learning;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion identification based on multimodal data (e.g., audio, video, text, etc.) is one of the most demanding and important research fields, with various uses. In this context, this research work has conducted a rigorous exploration of model-level fusion to find out the optimal multimodal model for emotion recognition using audio and video modalities. More specifically, separate novel feature extractor networks for audio and video data are proposed. After that, an optimal multimodal emotion recognition model is created by fusing audio and video features at the model level. The performances of the proposed models are assessed based on two benchmark multimodal datasets namely Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Surrey Audio-Visual Expressed Emotion (SAVEE) using various performance metrics. The proposed models achieve high predictive accuracies of 99% and 86% on the SAVEE and RAVDESS datasets, respectively. The effectiveness of the models are also verified by comparing their performances with the existing emotion recognition models. Some case studies are also conducted to explore the model's ability to capture the variability of emotional states of the speakers in publicly available real-world audio-visual media.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:14
相关论文
共 65 条
  • [1] Abdullah Muhammad, 2020, 2020 INT C ELECT INF
  • [2] Audiovisual emotion recognition in wild
    Avots, Egils
    Sapinski, Tomasz
    Bachmann, Maie
    Kaminska, Dorota
    [J]. MACHINE VISION AND APPLICATIONS, 2019, 30 (05) : 975 - 985
  • [3] Bagadi K.R., 2021, J PHYS C SERIES, V1917
  • [4] Busso C., 2004, PROC INT C MULTIMODA, P205
  • [5] Cambria Erik, 2012, Cognitive Behavioural Systems (COST 2012). International Training School. Revised Selected Papers, P144, DOI 10.1007/978-3-642-34584-5_11
  • [6] Cambria E., 2013, P 2013 IEEE S COMPUT, DOI 10.1109/cihli.2013.6613272
  • [7] Fuzzy commonsense reasoning for multimodal sentiment analysis
    Chaturvedi, Iti
    Satapathy, Ranjan
    Cavallari, Sandro
    Cambria, Erik
    [J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 264 - 270
  • [8] K-Means Clustering-based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition
    Chen, Luefeng
    Wang, Kuanlin
    Wu, Min
    Pedrycz, Witold
    Hirota, Kaoru
    [J]. IFAC PAPERSONLINE, 2020, 53 (02): : 10250 - 10254
  • [9] Delbrouck J.B., 2020, ARXIV PREPRINT ARXIV
  • [10] Emotion Recognition In The Wild Challenge 2013
    Dhall, Abhinav
    Goecke, Roland
    Joshi, Jyoti
    Wagner, Michael
    Gedeon, Tom
    [J]. ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, : 509 - 515