Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引:1
作者
Goncalves, Lucas [1 ]
Chou, Huang-Cheng [2 ]
Salman, Ali N. [1 ]
Lee, Chi-Chun [2 ]
Busso, Carlos [1 ,3 ]
机构
[1] Univ Texas Dallas, Richardson, TX 75080 USA
[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
来源
IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷
基金
美国国家科学基金会;
关键词
Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;
D O I
10.1109/OJSP.2025.3530274
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.
引用
收藏
页码:165 / 174
页数:10
相关论文
共 50 条
[31]   Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech [J].
Zhang, Shiqing ;
Li, Lemin ;
Zhao, Zhijin .
MULTIMEDIA AND SIGNAL PROCESSING, 2012, 346 :46-+
[32]   Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition [J].
Zhou, Hengshun ;
Du, Jun ;
Zhang, Yuanyuan ;
Wang, Qing ;
Liu, Qing-Feng ;
Lee, Chin-Hui .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :2617-2629
[33]   A Neural Network Architecture for Children's Audio-Visual Emotion Recognition [J].
Matveev, Anton ;
Matveev, Yuri ;
Frolova, Olga ;
Nikolaev, Aleksandr ;
Lyakso, Elena .
MATHEMATICS, 2023, 11 (22)
[34]   Leveraging Inter-rater Agreement for Audio-Visual Emotion Recognition [J].
Kim, Yelin ;
Provost, Emily Mower .
2015 INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2015, :553-559
[35]   Emotion Recognition From Audio-Visual Data Using Rule Based Decision Level Fusion [J].
Sahoo, Subhasmita ;
Routray, Aurobinda .
PROCEEDINGS OF THE 2016 IEEE STUDENTS' TECHNOLOGY SYMPOSIUM (TECHSYM), 2016, :7-12
[36]   Exploring Sources of Variation in Human Behavioral Data: Towards Automatic Audio-Visual Emotion Recognition [J].
Kim, Yelin .
2015 INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2015, :748-753
[37]   Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition [J].
Shi, Shuo ;
Qin, Jia-Jun ;
Yu, Yang ;
Hao, Xiao-Ke .
Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2024, 52 (08) :2824-2835
[38]   ATTA-NET: ATTENTION AGGREGATION NETWORK FOR AUDIO-VISUAL EMOTION RECOGNITION [J].
Fan, Ruijia ;
Liu, Hong ;
Li, Yidi ;
Guo, Peini ;
Wang, Guoquan ;
Wang, Ti .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :8030-8034
[39]   Feature and Decision Level Audio-visual Data Fusion in Emotion Recognition Problem [J].
Sidorov, Maxim ;
Sopov, Evgenii ;
Ivanov, Ilia ;
Minker, Wolfgang .
ICIMCO 2015 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL. 2, 2015, :246-251
[40]   HUMAN-LIKE EMOTION RECOGNITION: MULTI-LABEL LEARNING FROM NOISY LABELED AUDIO-VISUAL EXPRESSIVE SPEECH [J].
Kim, Yelin ;
Kim, Jeesun .
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5104-5108