Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引:0
|
作者
Goncalves, Lucas [1 ]
Chou, Huang-Cheng [2 ]
Salman, Ali N. [1 ]
Lee, Chi-Chun [2 ]
Busso, Carlos [1 ,3 ]
机构
[1] Univ Texas Dallas, Richardson, TX 75080 USA
[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
来源
IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷
基金
美国国家科学基金会;
关键词
Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;
D O I
10.1109/OJSP.2025.3530274
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.
引用
收藏
页码:165 / 174
页数:10
相关论文
共 50 条
  • [1] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):
  • [2] Audio-visual spontaneous emotion recognition
    Zeng, Zhihong
    Hu, Yuxiao
    Roisman, Glenn I.
    Wen, Zhen
    Fu, Yun
    Huang, Thomas S.
    ARTIFICIAL INTELLIGENCE FOR HUMAN COMPUTING, 2007, 4451 : 72 - +
  • [3] An Active Learning Paradigm for Online Audio-Visual Emotion Recognition
    Kansizoglou, Ioannis
    Bampis, Loukas
    Gasteratos, Antonios
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) : 756 - 768
  • [4] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
    Ghaleb, Esam
    Popa, Mirela
    Asteriadis, Stylianos
    IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48
  • [5] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
    Matsuda, Yuki
    Fedotov, Dmitrii
    Takahashi, Yuta
    Arakawa, Yutaka
    Yasumo, Keiichi
    Minker, Wolfgang
    PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
  • [6] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
  • [7] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [8] Deep Learning Based Audio-Visual Emotion Recognition in a Smart Learning Environment
    Ivleva, Natalja
    Pentel, Avar
    Dunajeva, Olga
    Justsenko, Valeria
    TOWARDS A HYBRID, FLEXIBLE AND SOCIALLY ENGAGED HIGHER EDUCATION, VOL 1, ICL 2023, 2024, 899 : 420 - 431
  • [9] Emotion recognition from unimodal to multimodal analysis: A review
    Ezzameli, K.
    Mahersia, H.
    INFORMATION FUSION, 2023, 99
  • [10] DISENTANGLEMENT FOR AUDIO-VISUAL EMOTION RECOGNITION USING MULTITASK SETUP
    Peri, Raghuveer
    Parthasarathy, Srinivas
    Bradshaw, Charles
    Sundaram, Shiva
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6344 - 6348