Speech Emotion Recognition Model Based on Joint Modeling of Discrete and Dimensional Emotion Representation

被引:1
作者
Bautista, John Lorenzo [1 ,2 ]
Shin, Hyun Soon [1 ,2 ]
机构
[1] Elect & Telecommun Res Inst ETRI, Emot Recognit IoT Res Sect, Hyper Connected Commun Res Lab, Daejeon 34129, South Korea
[2] Korea Univ Sci & Technol, ETRI Sch Artificial Intelligence, Daejeon 34113, South Korea
来源
APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 02期
关键词
adaptive weight balancing scheme; affective computing; dimensional emotion representation; discrete emotion representation; joint model architecture; Speech Emotion Recognition (SER); FRAMEWORK;
D O I
10.3390/app15020623
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
This paper introduces a novel joint model architecture for Speech Emotion Recognition (SER) that integrates both discrete and dimensional emotional representations, allowing for the simultaneous training of classification and regression tasks to improve the comprehensiveness and interpretability of emotion recognition. By employing a joint loss function that combines categorical and regression losses, the model ensures balanced optimization across tasks, with experiments exploring various weighting schemes using a tunable parameter to adjust task importance. Two adaptive weight balancing schemes, Dynamic Weighting and Joint Weighting, further enhance performance by dynamically adjusting task weights based on optimization progress and ensuring balanced emotion representation during backpropagation. The architecture employs parallel feature extraction through independent encoders, designed to capture unique features from multiple modalities, including Mel-frequency Cepstral Coefficients (MFCC), Short-term Features (STF), Mel-spectrograms, and raw audio signals. Additionally, pre-trained models such as Wav2Vec 2.0 and HuBERT are integrated to leverage their robust latent features. The inclusion of self-attention and co-attention mechanisms allows the model to capture relationships between input modalities and interdependencies among features, further improving its interpretability and integration capabilities. Experiments conducted on the IEMOCAP dataset using a leave-one-subject-out approach demonstrate the model's effectiveness, with results showing a 1-2% accuracy improvement over classification-only models. The optimal configuration, incorporating the joint architecture, dynamic weighting, and parallel processing of multimodal features, achieves a weighted accuracy of 72.66%, an unweighted accuracy of 73.22%, and a mean Concordance Correlation Coefficient (CCC) of 0.3717. These results validate the effectiveness of the proposed joint model architecture and adaptive balancing weight schemes in improving SER performance.
引用
收藏
页数:20
相关论文
共 41 条
[1]   LIGHT-SERNET: A LIGHTWEIGHT FULLY CONVOLUTIONAL NEURAL NETWORK FOR SPEECH EMOTION RECOGNITION [J].
Aftab, Arya ;
Morsali, Alireza ;
Ghaemmaghami, Shahrokh ;
Champagne, Benoit .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6912-6916
[2]  
Aldeneh Z, 2017, INT CONF ACOUST SPEE, P2741, DOI 10.1109/ICASSP.2017.7952655
[3]   On the Evolution of Speech Representations for Affective Computing: A brief history and critical overview [J].
Alisamir, Sina ;
Ringeval, Fabien .
IEEE SIGNAL PROCESSING MAGAZINE, 2021, 38 (06) :12-21
[4]  
Baevski A, 2020, ADV NEUR IN, V33
[5]  
Buechel B., 2018, IEEE Trans. Affect. Comput, V9, P54
[6]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[7]   EMOTION - A PSYCHOEVOLUTIONARY SYNTHESIS - PLUTCHIK,R [J].
CAMRAS, L .
AMERICAN JOURNAL OF PSYCHOLOGY, 1980, 93 (04) :751-753
[8]   Data Augmentation using GANs for Speech Emotion Recognition [J].
Chatziagapi, Aggelina ;
Paraskevopoulos, Georgios ;
Sgouropoulos, Dimitris ;
Pantazopoulos, Georgios ;
Nikandrou, Malvina ;
Giannakopoulos, Theodoros ;
Katsamanis, Athanasios ;
Potamianos, Alexandros ;
Narayanan, Shrikanth .
INTERSPEECH 2019, 2019, :171-175
[9]  
Chenchah Farah, 2019, 2019 International Conference on Signal, Control and Communication (SCC), P274, DOI 10.1109/SCC47175.2019.9116103
[10]   AN ARGUMENT FOR BASIC EMOTIONS [J].
EKMAN, P .
COGNITION & EMOTION, 1992, 6 (3-4) :169-200