Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引：0

作者：

Goncalves, Lucas ^{[1
]}

Chou, Huang-Cheng ^{[2
]}

Salman, Ali N. ^{[1
]}

Lee, Chi-Chun ^{[2
]}

Busso, Carlos ^{[1
,3
]}

机构：

[1] Univ Texas Dallas, Richardson, TX 75080 USA

[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷

基金：

美国国家科学基金会;

关键词：

Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;

D O I：

10.1109/OJSP.2025.3530274

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.

引用

页码：165 / 174

页数：10

共 50 条

[1] Audio-Visual Learning for Multimodal Emotion Recognition
Fan, Siyu
Jing, Jianan
Wang, Chongwen
SYMMETRY-BASEL, 2025, 17 (03):
[2] Audio-visual spontaneous emotion recognition
Zeng, Zhihong
Hu, Yuxiao
Roisman, Glenn I.
Wen, Zhen
Fu, Yun
Huang, Thomas S.
ARTIFICIAL INTELLIGENCE FOR HUMAN COMPUTING, 2007, 4451 : 72 - +
[3] An Active Learning Paradigm for Online Audio-Visual Emotion Recognition
Kansizoglou, Ioannis
Bampis, Loukas
Gasteratos, Antonios
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) : 756 - 768
[4] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48
[5] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
Matsuda, Yuki
Fedotov, Dmitrii
Takahashi, Yuta
Arakawa, Yutaka
Yasumo, Keiichi
Minker, Wolfgang
PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
[6] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
[7] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
Mroueh, Youssef
Marcheret, Etienne
Goel, Vaibhava
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
[8] Deep Learning Based Audio-Visual Emotion Recognition in a Smart Learning Environment
Ivleva, Natalja
Pentel, Avar
Dunajeva, Olga
Justsenko, Valeria
TOWARDS A HYBRID, FLEXIBLE AND SOCIALLY ENGAGED HIGHER EDUCATION, VOL 1, ICL 2023, 2024, 899 : 420 - 431
[9] Emotion recognition from unimodal to multimodal analysis: A review
Ezzameli, K.
Mahersia, H.
INFORMATION FUSION, 2023, 99
[10] DISENTANGLEMENT FOR AUDIO-VISUAL EMOTION RECOGNITION USING MULTITASK SETUP
Peri, Raghuveer
Parthasarathy, Srinivas
Bradshaw, Charles
Sundaram, Shiva
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6344 - 6348

← 1 2 3 4 5 →