Deep CNN with late fusion for real time multimodal emotion recognition

被引:4
|
作者
Dixit, Chhavi [1 ]
Satapathy, Shashank Mouli [2 ]
机构
[1] Shell India Markets Pvt Ltd, Bengaluru 560103, Karnataka, India
[2] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore 632014, Tamil Nadu, India
关键词
CNN; Cross dataset; Ensemble learning; FastText; Multimodal emotion recognition; Stacking; SENTIMENT ANALYSIS; MODEL;
D O I
10.1016/j.eswa.2023.122579
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion recognition is a fundamental aspect of human communication and plays a crucial role in various domains. This project aims at developing an efficient model for real-time multimodal emotion recognition in videos of human oration (opinion videos), where the speakers express their opinions about various topics. Four separate datasets contributing 20,000 samples for text, 1,440 for audio, 35,889 for images, and 3,879 videos for multimodal analysis respectively are used. One model is trained for each of the modalities: fastText for text analysis because of its efficiency, robustness to noise, and pre-trained embeddings; customized 1-D CNN for audio analysis using its translation invariance, hierarchical feature extraction, scalability, and generalization; custom 2-D CNN for image analysis because of its ability to capture local features and handle variations in image content. They are tested and combined on the CMU-MOSEI dataset using both bagging and stacking to find the most effective architecture. They are then used for real-time analysis of speeches. Each of the models is trained on 80% of the datasets, the remaining 20% is used for testing individual and combined accuracies in CMU-MOSEI. The emotions finally predicted by the architecture correspond to the six classes in the CMU-MOSEI dataset. This cross-dataset training and testing of the models makes them robust and efficient for general use, removes reliance on a specific domain or dataset, and adds more data points for model training. The proposed architecture was able to achieve an accuracy of 85.85% and an F1-score of 83 on the CMU-MOSEI dataset.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities
    Middya A.I.
    Nag B.
    Roy S.
    Knowledge-Based Systems, 2022, 244
  • [32] Enhanced multimodal emotion recognition in healthcare analytics: A deep learning based model-level fusion approach
    Islam, Md. Milon
    Nooruddin, Sheikh
    Karray, Fakhri
    Muhammad, Ghulam
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 94
  • [33] A review of multimodal emotion recognition from datasets, preprocessing, and fusion methods
    Pan, Bei
    Hirota, Kaoru
    Jia, Zhiyang
    Dai, Yaping
    NEUROCOMPUTING, 2023, 561
  • [34] Multimodal Emotion Recognition Using a Hierarchical Fusion Convolutional Neural Network
    Zhang, Yong
    Cheng, Cheng
    Zhang, Yidie
    IEEE ACCESS, 2021, 9 : 7943 - 7951
  • [35] Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition
    Li, Bobo
    Fei, Hao
    Liao, Lizi
    Zhao, Yu
    Teng, Chong
    Chua, Tat-Seng
    Ji, Donghong
    Li, Fei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5923 - 5934
  • [36] A Hybrid Latent Space Data Fusion Method for Multimodal Emotion Recognition
    Nemati, Shahla
    Rohani, Reza
    Basiri, Mohammad Ehsan
    Abdar, Moloud
    Yen, Neil Y.
    Makarenkov, Vladimir
    IEEE ACCESS, 2019, 7 : 172948 - 172964
  • [37] MULTIMODAL EMOTION RECOGNITION WITH CAPSULE GRAPH CONVOLUTIONAL BASED REPRESENTATION FUSION
    Liu, Jiaxing
    Chen, Sen
    Wang, Longbiao
    Liu, Zhilei
    Fu, Yahui
    Guo, Lili
    Dang, Jianwu
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6339 - 6343
  • [38] A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
    Lian, Hailun
    Lu, Cheng
    Li, Sunan
    Zhao, Yan
    Tang, Chuangao
    Zong, Yuan
    ENTROPY, 2023, 25 (10)
  • [39] Emotion Recognition Based On CNN
    Cao, Guolu
    Ma, Yuliang
    Meng, Xiaofei
    Gao, Yunyuan
    Meng, Ming
    PROCEEDINGS OF THE 38TH CHINESE CONTROL CONFERENCE (CCC), 2019, : 8627 - 8630
  • [40] Deep emotional arousal network for multimodal sentiment analysis and emotion recognition
    Zhang F.
    Li X.-C.
    Dong C.-R.
    Hua Q.
    Kongzhi yu Juece/Control and Decision, 2022, 37 (11): : 2984 - 2992