Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

被引:37
|
作者
Ma, Fei [1 ]
Li, Yang [1 ]
Ni, Shiguang [2 ]
Huang, Shao-Lun [1 ]
Zhang, Lin [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China
[2] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 01期
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual emotion recognition; data augmentation; multimodal conditional generative adversarial network (GAN); Hirschfeld-Gebelein-Renyi (HGR) maximal correlation; CLASSIFICATION; CONNECTION; MULTIMEDIA; NETWORKS; FEATURES; SPEECH; FUSION;
D O I
10.3390/app12010527
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Audio-visual emotion recognition is the research of identifying human emotional states by combining the audio modality and the visual modality simultaneously, which plays an important role in intelligent human-machine interactions. With the help of deep learning, previous works have made great progress for audio-visual emotion recognition. However, these deep learning methods often require a large amount of data for training. In reality, data acquisition is difficult and expensive, especially for the multimodal data with different modalities. As a result, the training data may be in the low-data regime, which cannot be effectively used for deep learning. In addition, class imbalance may occur in the emotional data, which can further degrade the performance of audio-visual emotion recognition. To address these problems, we propose an efficient data augmentation framework by designing a multimodal conditional generative adversarial network (GAN) for audio-visual emotion recognition. Specifically, we design generators and discriminators for audio and visual modalities. The category information is used as their shared input to make sure our GAN can generate fake data of different categories. In addition, the high dependence between the audio modality and the visual modality in the generated multimodal data is modeled based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation. In this way, we relate different modalities in the generated data to approximate the real data. Then, the generated data are used to augment our data manifold. We further apply our approach to deal with the problem of class imbalance. To the best of our knowledge, this is the first work to propose a data augmentation strategy with a multimodal conditional GAN for audio-visual emotion recognition. We conduct a series of experiments on three public multimodal datasets, including eNTERFACE'05, RAVDESS, and CMEW. The results indicate that our multimodal conditional GAN has high effectiveness for data augmentation of audio-visual emotion recognition.
引用
收藏
页数:24
相关论文
共 50 条
  • [31] EEG Data Augmentation for Emotion Recognition with a Task-Driven GAN
    Liu, Qing
    Hao, Jianjun
    Guo, Yijun
    ALGORITHMS, 2023, 16 (02)
  • [32] Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion Recognition
    Hsu, Jia-Hao
    Wu, Chung-Hsien
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 3231 - 3243
  • [33] Audio-Visual Recognition System in Compression Domain
    Wong, Yee Wan
    Seng, Kah Phooi
    Ang, Li-Minn
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2011, 21 (05) : 637 - 646
  • [34] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
    Takashima, Akihiko
    Masumura, Ryo
    Ando, Atsushi
    Yamazaki, Yoshihiro
    Uchida, Mihiro
    Orihashi, Shota
    INTERSPEECH 2022, 2022, : 4740 - 4744
  • [35] Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition
    Farhoudi, Zeinab
    Setayeshi, Saeed
    SPEECH COMMUNICATION, 2021, 127 : 92 - 103
  • [36] An Efficient Reliability Estimation Technique For Audio-Visual Person Identification
    Alam, Mohammad Rafiqul
    Bennamoun, Mohammed
    Togneri, Roberto
    Sohel, Ferdous
    PROCEEDINGS OF THE 2013 IEEE 8TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2013, : 1631 - 1635
  • [37] Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks
    Huang, Jian
    Li, Ya
    Tao, Jianhua
    Lian, Zheng
    Niu, Mingyue
    Yang, Minghao
    PROCEEDINGS OF THE 2018 AUDIO/VISUAL EMOTION CHALLENGE AND WORKSHOP (AVEC'18), 2018, : 57 - 64
  • [38] Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task
    Markitantov, Maxim
    Ryumina, Elena
    Ryumin, Dmitry
    Karpov, Alexey
    INTERSPEECH 2022, 2022, : 1756 - 1760
  • [39] AUDIO-VISUAL PERSON RECOGNITION IN MULTIMEDIA DATA FROM THE IARPA JANUS PROGRAM
    Sell, Gregory
    Duh, Kevin
    Snyder, David
    Etter, Dave
    Garcia-Romero, Daniel
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 3031 - 3035
  • [40] Multimodal Dance Generation Networks Based on Audio-Visual Analysis
    Duan, Lijuan
    Xu, Xiao
    En, Qing
    INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2021, 12 (01) : 17 - 32