A multimodal dynamical variational autoencoder for audiovisual speech representation learning

被引:4
作者
Sadok, Samir [1 ]
Leglaive, Simon [1 ]
Girin, Laurent [2 ]
Alameda-Pineda, Xavier [3 ]
Seguier, Renaud [1 ]
机构
[1] CentraleSupelec, UMR CNRS 6164, IETR, Gif Sur Yvette, France
[2] Univ Grenoble Alpes, CNRS, Grenoble INP, GIPSA Lab, Grenoble, France
[3] Univ Grenoble Alpes, Inria, LJK, CNRS, Grenoble, France
基金
欧盟地平线“2020”;
关键词
Deep generative modeling; Disentangled representation learning; Variational autoencoder; Multimodal and dynamical data; Audiovisual speech processing;
D O I
10.1016/j.neunet.2024.106120
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High -dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high -dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality -specific versus modality -common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.
引用
收藏
页数:16
相关论文
共 72 条
  • [1] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [2] Influence of lips on the production of vowels based on finite element simulations and experiments
    Arnela, Marc
    Blandin, Remi
    Dabbaghchian, Saeed
    Guasch, Oriol
    Alias, Francesc
    Pelorson, Xavier
    Van Hirtum, Annemie
    Engwall, Olov
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) : 2852 - 2859
  • [3] MultiMAE: Multi-modal Multi-task Masked Autoencoders
    Bachmann, Roman
    Mizrahi, David
    Atanov, Andrei
    Zamir, Amir
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 348 - 367
  • [4] Baltrusaitis T, 2016, IEEE WINT CONF APPL
  • [5] Multimodal Machine Learning: A Survey and Taxonomy
    Baltrusaitis, Tadas
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) : 423 - 443
  • [6] Representation Learning: A Review and New Perspectives
    Bengio, Yoshua
    Courville, Aaron
    Vincent, Pascal
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (08) : 1798 - 1828
  • [7] Correlated expression of the body, face, and voice during character portrayal in actors
    Berry, Matthew
    Lewin, Sarah
    Brown, Steven
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)
  • [8] Bishop C M., 2006, Pattern recognition and machine learning by Christopher M. Bishop
  • [9] Boersma P., 2021, Version, V5, P74
  • [10] Chen RTQ, 2018, ADV NEUR IN, V31