Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

被引:37
|
作者
Ma, Fei [1 ]
Li, Yang [1 ]
Ni, Shiguang [2 ]
Huang, Shao-Lun [1 ]
Zhang, Lin [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Shenzhen 518055, Peoples R China
[2] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 01期
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
audio-visual emotion recognition; data augmentation; multimodal conditional generative adversarial network (GAN); Hirschfeld-Gebelein-Renyi (HGR) maximal correlation; CLASSIFICATION; CONNECTION; MULTIMEDIA; NETWORKS; FEATURES; SPEECH; FUSION;
D O I
10.3390/app12010527
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Audio-visual emotion recognition is the research of identifying human emotional states by combining the audio modality and the visual modality simultaneously, which plays an important role in intelligent human-machine interactions. With the help of deep learning, previous works have made great progress for audio-visual emotion recognition. However, these deep learning methods often require a large amount of data for training. In reality, data acquisition is difficult and expensive, especially for the multimodal data with different modalities. As a result, the training data may be in the low-data regime, which cannot be effectively used for deep learning. In addition, class imbalance may occur in the emotional data, which can further degrade the performance of audio-visual emotion recognition. To address these problems, we propose an efficient data augmentation framework by designing a multimodal conditional generative adversarial network (GAN) for audio-visual emotion recognition. Specifically, we design generators and discriminators for audio and visual modalities. The category information is used as their shared input to make sure our GAN can generate fake data of different categories. In addition, the high dependence between the audio modality and the visual modality in the generated multimodal data is modeled based on Hirschfeld-Gebelein-Renyi (HGR) maximal correlation. In this way, we relate different modalities in the generated data to approximate the real data. Then, the generated data are used to augment our data manifold. We further apply our approach to deal with the problem of class imbalance. To the best of our knowledge, this is the first work to propose a data augmentation strategy with a multimodal conditional GAN for audio-visual emotion recognition. We conduct a series of experiments on three public multimodal datasets, including eNTERFACE'05, RAVDESS, and CMEW. The results indicate that our multimodal conditional GAN has high effectiveness for data augmentation of audio-visual emotion recognition.
引用
收藏
页数:24
相关论文
共 50 条
  • [41] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
    Mezaris, Vasileios
    Gidaros, Spyros
    Papadopoulos, Georgios Th.
    Kasper, Walter
    Steffen, Joerg
    Ordelman, Roeland
    Huijbregts, Marijn
    de Jong, Franciska
    Kompatsiaris, Ioannis
    Strintzis, Michael G.
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2010,
  • [42] A multimodal hierarchical approach to speech emotion recognition from audio and text
    Singh, Prabhav
    Srivastava, Ridam
    Rana, K. P. S.
    Kumar, Vineet
    KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [43] Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey
    Mandalapu, Hareesh
    Reddy, Aravinda P. N.
    Ramachandra, Raghavendra
    Rao, Krothapalli Sreenivasa
    Mitra, Pabitra
    Prasanna, S. R. Mahadeva
    Busch, Christoph
    IEEE ACCESS, 2021, 9 : 37431 - 37455
  • [44] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
    Tao, Ruijie
    Das, Rohan Kumar
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 2242 - 2246
  • [45] Audio-Visual Group Recognition Using Diffusion Maps
    Keller, Yosi
    Coifman, Ronald R.
    Lafon, Stephane
    Zucker, Steven W.
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2010, 58 (01) : 403 - 413
  • [46] On Dynamic Stream Weighting for Audio-Visual Speech Recognition
    Estellers, Virginia
    Gurban, Mihai
    Thiran, Jean-Philippe
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1145 - 1157
  • [47] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [48] AVERFormer: End-to-end audio-visual emotion recognition transformer framework with balanced modal contributions
    Sun, Zijian
    Liu, Haoran
    Li, Haibin
    Li, Yaqian
    Zhang, Wenming
    DIGITAL SIGNAL PROCESSING, 2025, 161
  • [49] Audio-Visual Speaker Recognition for Video Broadcast News
    Benoît Maison
    Chalapathy Neti
    Andrew Senior
    Journal of VLSI signal processing systems for signal, image and video technology, 2001, 29 : 71 - 79
  • [50] Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation
    Roh, Kyung-Min
    Lee, Seok-Pil
    APPLIED SCIENCES-BASEL, 2024, 14 (21):