Deep CNN with late fusion for real time multimodal emotion recognition

被引：4

作者：

Dixit, Chhavi ^{[1
]}

Satapathy, Shashank Mouli ^{[2
]}

机构：

[1] Shell India Markets Pvt Ltd, Bengaluru 560103, Karnataka, India

[2] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore 632014, Tamil Nadu, India

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 240卷

关键词：

CNN; Cross dataset; Ensemble learning; FastText; Multimodal emotion recognition; Stacking; SENTIMENT ANALYSIS; MODEL;

D O I：

10.1016/j.eswa.2023.122579

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion recognition is a fundamental aspect of human communication and plays a crucial role in various domains. This project aims at developing an efficient model for real-time multimodal emotion recognition in videos of human oration (opinion videos), where the speakers express their opinions about various topics. Four separate datasets contributing 20,000 samples for text, 1,440 for audio, 35,889 for images, and 3,879 videos for multimodal analysis respectively are used. One model is trained for each of the modalities: fastText for text analysis because of its efficiency, robustness to noise, and pre-trained embeddings; customized 1-D CNN for audio analysis using its translation invariance, hierarchical feature extraction, scalability, and generalization; custom 2-D CNN for image analysis because of its ability to capture local features and handle variations in image content. They are tested and combined on the CMU-MOSEI dataset using both bagging and stacking to find the most effective architecture. They are then used for real-time analysis of speeches. Each of the models is trained on 80% of the datasets, the remaining 20% is used for testing individual and combined accuracies in CMU-MOSEI. The emotions finally predicted by the architecture correspond to the six classes in the CMU-MOSEI dataset. This cross-dataset training and testing of the models makes them robust and efficient for general use, removes reliance on a specific domain or dataset, and adds more data points for model training. The proposed architecture was able to achieve an accuracy of 85.85% and an F1-score of 83 on the CMU-MOSEI dataset.

引用

页数：15

共 50 条

[31] Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities
Middya A.I.
Nag B.
Roy S.
Knowledge-Based Systems, 2022, 244
[32] Enhanced multimodal emotion recognition in healthcare analytics: A deep learning based model-level fusion approach
Islam, Md. Milon
Nooruddin, Sheikh
Karray, Fakhri
Muhammad, Ghulam
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 94
[33] A review of multimodal emotion recognition from datasets, preprocessing, and fusion methods
Pan, Bei
Hirota, Kaoru
Jia, Zhiyang
Dai, Yaping
NEUROCOMPUTING, 2023, 561
[34] Multimodal Emotion Recognition Using a Hierarchical Fusion Convolutional Neural Network
Zhang, Yong
Cheng, Cheng
Zhang, Yidie
IEEE ACCESS, 2021, 9 : 7943 - 7951
[35] Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition
Li, Bobo
Fei, Hao
Liao, Lizi
Zhao, Yu
Teng, Chong
Chua, Tat-Seng
Ji, Donghong
Li, Fei
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5923 - 5934
[36] A Hybrid Latent Space Data Fusion Method for Multimodal Emotion Recognition
Nemati, Shahla
Rohani, Reza
Basiri, Mohammad Ehsan
Abdar, Moloud
Yen, Neil Y.
Makarenkov, Vladimir
IEEE ACCESS, 2019, 7 : 172948 - 172964
[37] MULTIMODAL EMOTION RECOGNITION WITH CAPSULE GRAPH CONVOLUTIONAL BASED REPRESENTATION FUSION
Liu, Jiaxing
Chen, Sen
Wang, Longbiao
Liu, Zhilei
Fu, Yahui
Guo, Lili
Dang, Jianwu
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6339 - 6343
[38] A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
Lian, Hailun
Lu, Cheng
Li, Sunan
Zhao, Yan
Tang, Chuangao
Zong, Yuan
ENTROPY, 2023, 25 (10)
[39] Emotion Recognition Based On CNN
Cao, Guolu
Ma, Yuliang
Meng, Xiaofei
Gao, Yunyuan
Meng, Ming
PROCEEDINGS OF THE 38TH CHINESE CONTROL CONFERENCE (CCC), 2019, : 8627 - 8630
[40] Deep emotional arousal network for multimodal sentiment analysis and emotion recognition
Zhang F.
Li X.-C.
Dong C.-R.
Hua Q.
Kongzhi yu Juece/Control and Decision, 2022, 37 (11): : 2984 - 2992

← 1 2 3 4 5 →