Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

被引:37
|
作者
Ma, Fei [1 ]
Li, Yang [1 ]
Ni, Shiguang [2 ]
Huang, Shao-Lun [1 ]
Zhang, Lin [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China
[2] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 01期
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual emotion recognition; data augmentation; multimodal conditional generative adversarial network (GAN); Hirschfeld-Gebelein-Renyi (HGR) maximal correlation; CLASSIFICATION; CONNECTION; MULTIMEDIA; NETWORKS; FEATURES; SPEECH; FUSION;
D O I
10.3390/app12010527
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Audio-visual emotion recognition is the research of identifying human emotional states by combining the audio modality and the visual modality simultaneously, which plays an important role in intelligent human-machine interactions. With the help of deep learning, previous works have made great progress for audio-visual emotion recognition. However, these deep learning methods often require a large amount of data for training. In reality, data acquisition is difficult and expensive, especially for the multimodal data with different modalities. As a result, the training data may be in the low-data regime, which cannot be effectively used for deep learning. In addition, class imbalance may occur in the emotional data, which can further degrade the performance of audio-visual emotion recognition. To address these problems, we propose an efficient data augmentation framework by designing a multimodal conditional generative adversarial network (GAN) for audio-visual emotion recognition. Specifically, we design generators and discriminators for audio and visual modalities. The category information is used as their shared input to make sure our GAN can generate fake data of different categories. In addition, the high dependence between the audio modality and the visual modality in the generated multimodal data is modeled based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation. In this way, we relate different modalities in the generated data to approximate the real data. Then, the generated data are used to augment our data manifold. We further apply our approach to deal with the problem of class imbalance. To the best of our knowledge, this is the first work to propose a data augmentation strategy with a multimodal conditional GAN for audio-visual emotion recognition. We conduct a series of experiments on three public multimodal datasets, including eNTERFACE'05, RAVDESS, and CMEW. The results indicate that our multimodal conditional GAN has high effectiveness for data augmentation of audio-visual emotion recognition.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
    Matsuda, Yuki
    Fedotov, Dmitrii
    Takahashi, Yuta
    Arakawa, Yutaka
    Yasumo, Keiichi
    Minker, Wolfgang
    PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
  • [2] EEG data augmentation for emotion recognition with a multiple generator conditional Wasserstein GAN
    Zhang, Aiming
    Su, Lei
    Zhang, Yin
    Fu, Yunfa
    Wu, Liping
    Liang, Shengjin
    COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (04) : 3059 - 3071
  • [3] An Active Learning Paradigm for Online Audio-Visual Emotion Recognition
    Kansizoglou, Ioannis
    Bampis, Loukas
    Gasteratos, Antonios
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) : 756 - 768
  • [4] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [5] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
    Ma, Fei
    Zhang, Wei
    Li, Yang
    Huang, Shao-Lun
    Zhang, Lin
    APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
  • [6] A GAN-Based Data Augmentation Method for Multimodal Emotion Recognition
    Luo, Yun
    Zhu, Li-Zhen
    Lu, Bao-Liang
    ADVANCES IN NEURAL NETWORKS - ISNN 2019, PT I, 2019, 11554 : 141 - 150
  • [7] ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition
    Kim, Yelin
    Provost, Emily Mower
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (02) : 196 - 208
  • [8] EEG data augmentation for emotion recognition with a multiple generator conditional Wasserstein GAN
    Aiming Zhang
    Lei Su
    Yin Zhang
    Yunfa Fu
    Liping Wu
    Shengjin Liang
    Complex & Intelligent Systems, 2022, 8 : 3059 - 3071
  • [9] AUDIO-VISUAL EMOTION RECOGNITION WITH BOOSTED COUPLED HMM
    Lu, Kun
    Jia, Yunde
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1148 - 1151
  • [10] Exploring Sources of Variation in Human Behavioral Data: Towards Automatic Audio-Visual Emotion Recognition
    Kim, Yelin
    2015 INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2015, : 748 - 753