Uncertainty-Based Learning of a Lightweight Model for Multimodal Emotion Recognition

被引:1
作者
Radoi, Anamaria [1 ]
Cioroiu, George [1 ]
机构
[1] NUST Politehn Bucharest, Dept Appl Elect & Informat Engn, Bucharest 060042, Romania
关键词
Emotion recognition; Visualization; Feature extraction; Training; Computer architecture; Data mining; Transformers; Convolutional neural networks; Entropy; Uncertainty; entropy; multimodal emotion recognition; uncertainty-based learning; MTCNN; CREMA-D; RAVDESS; FACIAL EXPRESSION; NEURAL-NETWORKS; REPRESENTATIONS;
D O I
10.1109/ACCESS.2024.3450674
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotion recognition is a key research topic in the Affective Computing domain, with implications in marketing, human-robot interaction, and health domains. The continuous technological advances in terms of sensors and the rapid development of artificial intelligence technologies led to breakthroughs and improved the interpretation of human emotions. In this paper, we propose a lightweight neural network architecture that extracts and performs the analysis of multimodal information using the same audio and visual networks across multiple temporal segments. Undoubtedly, data collection and annotation for emotion recognition tasks remain challenging aspects in terms of required expertise and effort spent. In this sense, the learning process of the proposed multimodal architecture is based on an iterative procedure that starts with a small volume of annotated samples and allows a step-by-step improvement of the system by assessing the model uncertainty in recognizing discrete emotions. Specifically, at each epoch, the learning process is guided by the most uncertainly annotated samples and integrates different modes of expressing emotions through a simple augmentation technique. The framework is tested on two publicly available multimodal datasets for emotion recognition, i.e. CREMA-D and RAVDESS, using 5-folds cross-validation. Compared to state-of-the-art methods, the achieved performance demonstrates the effectiveness of the proposed approach, with an overall accuracy of 74.2 % on CREMA-D and 76.3 % on RAVDESS. Moreover, with a small number of model parameters and a low inference time, the proposed neural network architecture represents a valid candidate for the integration on platforms with limited memory and computational resources.
引用
收藏
页码:120362 / 120374
页数:13
相关论文
共 73 条
[1]   pAtbP-EnC: Identifying Anti-Tubercular Peptides Using Multi-Feature Representation and Genetic Algorithm-Based Deep Ensemble Model [J].
Akbar, Shahid ;
Raza, Ali ;
Al Shloul, Tamara ;
Ahmad, Ashfaq ;
Saeed, Aamir ;
Ghadi, Yazeed Yasin ;
Mamyrbayev, Orken ;
Tag-Eldin, Elsayed .
IEEE ACCESS, 2023, 11 :137099-137114
[2]   UNIFIED APPROACH TO SHORT-TIME FOURIER-ANALYSIS AND SYNTHESIS [J].
ALLEN, JB ;
RABINER, LR .
PROCEEDINGS OF THE IEEE, 1977, 65 (11) :1558-1564
[3]  
Aytar Y, 2016, ADV NEUR IN, V29
[4]  
Baevski A, 2020, ADV NEUR IN, V33
[5]  
Bakir Vian., 2020, Affective Politics of Digital Media, P263
[6]  
Barros P, 2020, Arxiv, DOI arXiv:1904.12632
[7]  
Beard R., 2018, P 22 C COMP NAT LANG, P251
[8]   Exploring the Contextual Factors Affecting Multimodal Emotion Recognition in Videos [J].
Bhattacharya, Prasanta ;
Gupta, Raj Kumar ;
Yang, Yinping .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (02) :1547-1557
[9]  
Birhala A, 2020, 2020 43RD INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), P305, DOI [10.1109/tsp49548.2020.9163474, 10.1109/TSP49548.2020.9163474]
[10]  
Burkert P, 2016, Arxiv, DOI [arXiv:1509.05371, DOI 10.48550/ARXIV.1509.05371, DOI arXiv:1509.05371.v2]