Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引：1

作者：

Goncalves, Lucas ^{[1
]}

Chou, Huang-Cheng ^{[2
]}

Salman, Ali N. ^{[1
]}

Lee, Chi-Chun ^{[2
]}

Busso, Carlos ^{[1
,3
]}

机构：

[1] Univ Texas Dallas, Richardson, TX 75080 USA

[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷

基金：

美国国家科学基金会;

关键词：

Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;

D O I：

10.1109/OJSP.2025.3530274

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.

引用

页码：165 / 174

页数：10

共 50 条

[41] Audio-Visual Emotion Recognition with Capsule-like Feature Representation and Model-Based Reinforcement Learning [J].

Ouyang, Xi ;

Nagisetty, Srikanth ;

Goh, Ester Gue Hua ;

Shen, Shengmei ;

Ding, Wan ;

Ming, Huaiping ;

Huang, Dong-Yan .

2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,

[42] Audio-Visual Emotion Recognition Using Big Data Towards 5G [J].

Hossain, M. Shamim ;

Muhammad, Ghulam ;

Alhamid, Mohammed F. ;

Song, Biao ;

Al-Mutib, Khaled .

MOBILE NETWORKS & APPLICATIONS, 2016, 21 (05) :753-763

[43] Audio-visual emotion recognition using multi-directional regression and Ridgelet transform [J].

Hossain, M. Shamim ;

Muhammad, Ghulam .

JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (04) :325-333

[44] Multi-Corpus Learning for Audio-Visual Emotions and Sentiment Recognition [J].

Ryumina, Elena ;

Markitantov, Maxim ;

Karpov, Alexey .

MATHEMATICS, 2023, 11 (16)

[45] Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin [J].

Kashevnik, Alexey ;

Lashkov, Igor ;

Axyonov, Alexandr ;

Ivanko, Denis ;

Ryumin, Dmitry ;

Kolchin, Artem ;

Karpov, Alexey .

IEEE ACCESS, 2021, 9 :34986-35003

[46] Audio-Visual Emotion Recognition Using Big Data Towards 5G [J].

M. Shamim Hossain ;

Ghulam Muhammad ;

Mohammed F. Alhamid ;

Biao Song ;

Khaled Al-Mutib .

Mobile Networks and Applications, 2016, 21 :753-763

[47] Audio-visual emotion recognition using multi-directional regression and Ridgelet transform [J].

M. Shamim Hossain ;

Ghulam Muhammad .

Journal on Multimodal User Interfaces, 2016, 10 :325-333

[48] Multi-modal Emotion Recognition Network with Balanced Audio-visual Feature Extraction [J].

Chen, Zeyu ;

Wu, Yiming ;

Cao, Ronghui .

2024 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER ENGINEERING, ICAICE, 2024, :675-679

[49] Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition [J].

Lin, Jen-Chun ;

Wu, Chung-Hsien ;

Wei, Wen-Li .

IEEE TRANSACTIONS ON MULTIMEDIA, 2012, 14 (01) :142-156

[50] Acted vs. Improvised: Domain Adaptation for Elicitation Approaches in Audio-Visual Emotion Recognition [J].

Li, Haoqi ;

Kim, Yelin ;

Kuo, Cheng-Hao ;

Narayanan, Shrikanth S. .

INTERSPEECH 2021, 2021, :3395-3399

← 1 2 3 4 5 →