Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

被引:37
|
作者
Ma, Fei [1 ]
Li, Yang [1 ]
Ni, Shiguang [2 ]
Huang, Shao-Lun [1 ]
Zhang, Lin [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China
[2] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 01期
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual emotion recognition; data augmentation; multimodal conditional generative adversarial network (GAN); Hirschfeld-Gebelein-Renyi (HGR) maximal correlation; CLASSIFICATION; CONNECTION; MULTIMEDIA; NETWORKS; FEATURES; SPEECH; FUSION;
D O I
10.3390/app12010527
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Audio-visual emotion recognition is the research of identifying human emotional states by combining the audio modality and the visual modality simultaneously, which plays an important role in intelligent human-machine interactions. With the help of deep learning, previous works have made great progress for audio-visual emotion recognition. However, these deep learning methods often require a large amount of data for training. In reality, data acquisition is difficult and expensive, especially for the multimodal data with different modalities. As a result, the training data may be in the low-data regime, which cannot be effectively used for deep learning. In addition, class imbalance may occur in the emotional data, which can further degrade the performance of audio-visual emotion recognition. To address these problems, we propose an efficient data augmentation framework by designing a multimodal conditional generative adversarial network (GAN) for audio-visual emotion recognition. Specifically, we design generators and discriminators for audio and visual modalities. The category information is used as their shared input to make sure our GAN can generate fake data of different categories. In addition, the high dependence between the audio modality and the visual modality in the generated multimodal data is modeled based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation. In this way, we relate different modalities in the generated data to approximate the real data. Then, the generated data are used to augment our data manifold. We further apply our approach to deal with the problem of class imbalance. To the best of our knowledge, this is the first work to propose a data augmentation strategy with a multimodal conditional GAN for audio-visual emotion recognition. We conduct a series of experiments on three public multimodal datasets, including eNTERFACE'05, RAVDESS, and CMEW. The results indicate that our multimodal conditional GAN has high effectiveness for data augmentation of audio-visual emotion recognition.
引用
收藏
页数:24
相关论文
共 50 条
  • [21] Audio-Visual Emotion Recognition System Using Multi-Modal Features
    Handa, Anand
    Agarwal, Rashi
    Kohli, Narendra
    INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE, 2021, 15 (04)
  • [22] The CASIA Audio Emotion Recognition Method for Audio/Visual Emotion Challenge 2011
    Pan, Shifeng
    Tao, Jianhua
    Li, Ya
    AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PT II, 2011, 6975 : 388 - 395
  • [23] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [24] Deep Learning for Audio Visual Emotion Recognition
    Hussain, T.
    Wang, W.
    Bouaynaya, N.
    Fathallah-Shaykh, H.
    Mihaylova, L.
    2022 25TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2022), 2022,
  • [25] Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels
    Lei, Yuanyuan
    Cao, Houwei
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 2954 - 2969
  • [26] Audio-Visual Emotion Recognition Using a Hybrid Deep Convolutional Neural Network based on Census Transform
    Cornejo, Jadisha Yarif Ramirez
    Pedrini, Helio
    2019 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC), 2019, : 3396 - 3402
  • [27] Joint low rank embedded multiple features learning for audio-visual emotion recognition
    Wang, Zhan
    Wang, Lizhi
    Huang, Hua
    NEUROCOMPUTING, 2020, 388 : 324 - 333
  • [28] Audio-visual emotion recognition using multi-directional regression and Ridgelet transform
    Hossain, M. Shamim
    Muhammad, Ghulam
    JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (04) : 325 - 333
  • [29] A Combined Rule-Based & Machine Learning Audio-Visual Emotion Recognition Approach
    Seng, Kah Phooi
    Ang, Li-Minn
    Ooi, Chien Shing
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2018, 9 (01) : 3 - 13
  • [30] Emotion Recognition Based on Feedback Weighted Fusion of Multimodal Emotion Data
    Wei, Wei
    Jia, Qingxuan
    Feng, Yongli
    2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (IEEE ROBIO 2017), 2017, : 1682 - 1687