Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引:1
作者
Goncalves, Lucas [1 ]
Chou, Huang-Cheng [2 ]
Salman, Ali N. [1 ]
Lee, Chi-Chun [2 ]
Busso, Carlos [1 ,3 ]
机构
[1] Univ Texas Dallas, Richardson, TX 75080 USA
[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
来源
IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷
基金
美国国家科学基金会;
关键词
Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;
D O I
10.1109/OJSP.2025.3530274
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.
引用
收藏
页码:165 / 174
页数:10
相关论文
共 50 条
[41]   Audio-Visual Emotion Recognition with Capsule-like Feature Representation and Model-Based Reinforcement Learning [J].
Ouyang, Xi ;
Nagisetty, Srikanth ;
Goh, Ester Gue Hua ;
Shen, Shengmei ;
Ding, Wan ;
Ming, Huaiping ;
Huang, Dong-Yan .
2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,
[42]   Audio-Visual Emotion Recognition Using Big Data Towards 5G [J].
Hossain, M. Shamim ;
Muhammad, Ghulam ;
Alhamid, Mohammed F. ;
Song, Biao ;
Al-Mutib, Khaled .
MOBILE NETWORKS & APPLICATIONS, 2016, 21 (05) :753-763
[43]   Audio-visual emotion recognition using multi-directional regression and Ridgelet transform [J].
Hossain, M. Shamim ;
Muhammad, Ghulam .
JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (04) :325-333
[44]   Multi-Corpus Learning for Audio-Visual Emotions and Sentiment Recognition [J].
Ryumina, Elena ;
Markitantov, Maxim ;
Karpov, Alexey .
MATHEMATICS, 2023, 11 (16)
[45]   Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin [J].
Kashevnik, Alexey ;
Lashkov, Igor ;
Axyonov, Alexandr ;
Ivanko, Denis ;
Ryumin, Dmitry ;
Kolchin, Artem ;
Karpov, Alexey .
IEEE ACCESS, 2021, 9 :34986-35003
[46]   Audio-Visual Emotion Recognition Using Big Data Towards 5G [J].
M. Shamim Hossain ;
Ghulam Muhammad ;
Mohammed F. Alhamid ;
Biao Song ;
Khaled Al-Mutib .
Mobile Networks and Applications, 2016, 21 :753-763
[47]   Audio-visual emotion recognition using multi-directional regression and Ridgelet transform [J].
M. Shamim Hossain ;
Ghulam Muhammad .
Journal on Multimodal User Interfaces, 2016, 10 :325-333
[48]   Multi-modal Emotion Recognition Network with Balanced Audio-visual Feature Extraction [J].
Chen, Zeyu ;
Wu, Yiming ;
Cao, Ronghui .
2024 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER ENGINEERING, ICAICE, 2024, :675-679
[49]   Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition [J].
Lin, Jen-Chun ;
Wu, Chung-Hsien ;
Wei, Wen-Li .
IEEE TRANSACTIONS ON MULTIMEDIA, 2012, 14 (01) :142-156
[50]   Acted vs. Improvised: Domain Adaptation for Elicitation Approaches in Audio-Visual Emotion Recognition [J].
Li, Haoqi ;
Kim, Yelin ;
Kuo, Cheng-Hao ;
Narayanan, Shrikanth S. .
INTERSPEECH 2021, 2021, :3395-3399