Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

被引:0
作者
Goncalves, Lucas [1 ]
Chou, Huang-Cheng [2 ]
Salman, Ali N. [1 ]
Lee, Chi-Chun [2 ]
Busso, Carlos [1 ,3 ]
机构
[1] Univ Texas Dallas, Richardson, TX 75080 USA
[2] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
来源
IEEE OPEN JOURNAL OF SIGNAL PROCESSING | 2025年 / 6卷
基金
美国国家科学基金会;
关键词
Emotion recognition; Training; Visualization; Annotations; Face recognition; Speech recognition; Computational modeling; Acoustics; Noise; Calibration; Multimodal learning; emotion recognition; audio-visual sentiment analysis; affective computing; emotion analysis; multi-label classification;
D O I
10.1109/OJSP.2025.3530274
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions-whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.
引用
收藏
页码:165 / 174
页数:10
相关论文
共 50 条
  • [21] A PRE-TRAINED AUDIO-VISUAL TRANSFORMER FOR EMOTION RECOGNITION
    Minh Tran
    Soleymani, Mohammad
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4698 - 4702
  • [22] Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition
    Song, Qiya
    Sun, Bin
    Li, Shutao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10028 - 10038
  • [23] EXTRACTING AUDIO-VISUAL FEATURES FOR EMOTION RECOGNITION THROUGH ACTIVE FEATURE SELECTION
    Haider, Fasih
    Pollak, Senja
    Albert, Pierre
    Luz, Saturnino
    2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
  • [24] Deep Learning for Audio Visual Emotion Recognition
    Hussain, T.
    Wang, W.
    Bouaynaya, N.
    Fathallah-Shaykh, H.
    Mihaylova, L.
    2022 25TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2022), 2022,
  • [25] An audio-visual corpus for multimodal automatic speech recognition
    Andrzej Czyzewski
    Bozena Kostek
    Piotr Bratoszewski
    Jozef Kotus
    Marcin Szykulski
    Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
  • [26] An audio-visual corpus for multimodal automatic speech recognition
    Czyzewski, Andrzej
    Kostek, Bozena
    Bratoszewski, Piotr
    Kotus, Jozef
    Szykulski, Marcin
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
  • [27] A Neural Network Architecture for Children's Audio-Visual Emotion Recognition
    Matveev, Anton
    Matveev, Yuri
    Frolova, Olga
    Nikolaev, Aleksandr
    Lyakso, Elena
    MATHEMATICS, 2023, 11 (22)
  • [28] Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech
    Zhang, Shiqing
    Li, Lemin
    Zhao, Zhijin
    MULTIMEDIA AND SIGNAL PROCESSING, 2012, 346 : 46 - +
  • [29] Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition
    Zhou, Hengshun
    Du, Jun
    Zhang, Yuanyuan
    Wang, Qing
    Liu, Qing-Feng
    Lee, Chin-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2617 - 2629
  • [30] Leveraging Inter-rater Agreement for Audio-Visual Emotion Recognition
    Kim, Yelin
    Provost, Emily Mower
    2015 INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2015, : 553 - 559