Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引：1

作者：

Goncalves, Lucas ^{[1
]}

Chou, Huang-Cheng ^{[2
]}

Salman, Ali N. ^{[1
]}

Lee, Chi-Chun ^{[2
]}

Busso, Carlos ^{[1
,3
]}

机构：

[1] Univ Texas Dallas, Richardson, TX 75080 USA

[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

来源：

IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷

基金：

美国国家科学基金会;

关键词：

Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;

D O I：

10.1109/OJSP.2025.3530274

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.

引用

页码：165 / 174

页数：10

共 50 条

[31] Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech [J].

Zhang, Shiqing ;

Li, Lemin ;

Zhao, Zhijin .

MULTIMEDIA AND SIGNAL PROCESSING, 2012, 346 :46-+

[32] Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition [J].

Zhou, Hengshun ;

Du, Jun ;

Zhang, Yuanyuan ;

Wang, Qing ;

Liu, Qing-Feng ;

Lee, Chin-Hui .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :2617-2629

[33] A Neural Network Architecture for Children's Audio-Visual Emotion Recognition [J].

Matveev, Anton ;

Matveev, Yuri ;

Frolova, Olga ;

Nikolaev, Aleksandr ;

Lyakso, Elena .

MATHEMATICS, 2023, 11 (22)

[34] Leveraging Inter-rater Agreement for Audio-Visual Emotion Recognition [J].

Kim, Yelin ;

Provost, Emily Mower .

2015 INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2015, :553-559

[35] Emotion Recognition From Audio-Visual Data Using Rule Based Decision Level Fusion [J].

Sahoo, Subhasmita ;

Routray, Aurobinda .

PROCEEDINGS OF THE 2016 IEEE STUDENTS' TECHNOLOGY SYMPOSIUM (TECHSYM), 2016, :7-12

[36] Exploring Sources of Variation in Human Behavioral Data: Towards Automatic Audio-Visual Emotion Recognition [J].

Kim, Yelin .

2015 INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2015, :748-753

[37] Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition [J].

Shi, Shuo ;

Qin, Jia-Jun ;

Yu, Yang ;

Hao, Xiao-Ke .

Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2024, 52 (08) :2824-2835

[38] ATTA-NET: ATTENTION AGGREGATION NETWORK FOR AUDIO-VISUAL EMOTION RECOGNITION [J].

Fan, Ruijia ;

Liu, Hong ;

Li, Yidi ;

Guo, Peini ;

Wang, Guoquan ;

Wang, Ti .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :8030-8034

[39] Feature and Decision Level Audio-visual Data Fusion in Emotion Recognition Problem [J].

Sidorov, Maxim ;

Sopov, Evgenii ;

Ivanov, Ilia ;

Minker, Wolfgang .

ICIMCO 2015 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL. 2, 2015, :246-251

[40] HUMAN-LIKE EMOTION RECOGNITION: MULTI-LABEL LEARNING FROM NOISY LABELED AUDIO-VISUAL EXPRESSIVE SPEECH [J].

Kim, Yelin ;

Kim, Jeesun .

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5104-5108

← 1 2 3 4 5 →