Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities

被引：0

作者：

Middya, Asif Iqbal ^{[1
]}

Nag, Baibhav ^{[2
]}

Roy, Sarbani ^{[1
]}

机构：

[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata, India

[2] Jadavpur Univ, Dept Math, Kolkata, India

来源：

KNOWLEDGE-BASED SYSTEMS | 2022年 / 244卷

关键词：

Multimodal emotion recognition; Audio features; Video features; Classification; Deep learning;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion identification based on multimodal data (e.g., audio, video, text, etc.) is one of the most demanding and important research fields, with various uses. In this context, this research work has conducted a rigorous exploration of model-level fusion to find out the optimal multimodal model for emotion recognition using audio and video modalities. More specifically, separate novel feature extractor networks for audio and video data are proposed. After that, an optimal multimodal emotion recognition model is created by fusing audio and video features at the model level. The performances of the proposed models are assessed based on two benchmark multimodal datasets namely Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Surrey Audio-Visual Expressed Emotion (SAVEE) using various performance metrics. The proposed models achieve high predictive accuracies of 99% and 86% on the SAVEE and RAVDESS datasets, respectively. The effectiveness of the models are also verified by comparing their performances with the existing emotion recognition models. Some case studies are also conducted to explore the model's ability to capture the variability of emotional states of the speakers in publicly available real-world audio-visual media.(c) 2022 Elsevier B.V. All rights reserved.

引用

页数：14

共 65 条

[1] Abdullah Muhammad, 2020, 2020 INT C ELECT INF
[2] Audiovisual emotion recognition in wild
Avots, Egils
Sapinski, Tomasz
Bachmann, Maie
Kaminska, Dorota
[J]. MACHINE VISION AND APPLICATIONS, 2019, 30 (05) : 975 - 985
[3] Bagadi K.R., 2021, J PHYS C SERIES, V1917
[4] Busso C., 2004, PROC INT C MULTIMODA, P205
[5] Cambria Erik, 2012, Cognitive Behavioural Systems (COST 2012). International Training School. Revised Selected Papers, P144, DOI 10.1007/978-3-642-34584-5_11
[6] Cambria E., 2013, P 2013 IEEE S COMPUT, DOI 10.1109/cihli.2013.6613272
[7] Fuzzy commonsense reasoning for multimodal sentiment analysis
Chaturvedi, Iti
Satapathy, Ranjan
Cavallari, Sandro
Cambria, Erik
[J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 264 - 270
[8] K-Means Clustering-based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition
Chen, Luefeng
Wang, Kuanlin
Wu, Min
Pedrycz, Witold
Hirota, Kaoru
[J]. IFAC PAPERSONLINE, 2020, 53 (02): : 10250 - 10254
[9] Delbrouck J.B., 2020, ARXIV PREPRINT ARXIV
[10] Emotion Recognition In The Wild Challenge 2013
Dhall, Abhinav
Goecke, Roland
Joshi, Jyoti
Wagner, Michael
Gedeon, Tom
[J]. ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, : 509 - 515

← 1 2 3 4 5 6 7 →