Disentanglement in a GAN for Unconditional Speech Synthesis

被引:0
|
作者
Baas, Matthew [1 ]
Kamper, Herman [1 ]
机构
[1] Stellenbosch Univ, Dept Elect & Elect Engn, ZA-7602 Stellenbosch, South Africa
关键词
Speech synthesis; Adaptation models; Task analysis; Generators; Convolution; Generative adversarial networks; Training; Unconditional speech synthesis; generative adversarial networks; speech disentanglement;
D O I
10.1109/TASLP.2024.3359352
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) - a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks.
引用
收藏
页码:1324 / 1335
页数:12
相关论文
共 50 条
  • [1] GAN YOU HEAR ME? RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS
    Baas, Matthew
    Kamper, Herman
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 906 - 911
  • [2] HiFi-GANw: Watermarked Speech Synthesis via Fine-Tuning of HiFi-GAN
    Cheng, Xiangyu
    Wang, Yaofei
    Liu, Chang
    Hu, Donghui
    Su, Zhaopin
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2440 - 2444
  • [3] FEditNet plus plus : Few-Shot Editing of Latent Semantics in GAN Spaces With Correlated Attribute Disentanglement
    Yi, Ran
    Hu, Teng
    Xia, Mengfei
    Tang, Yizhe
    Liu, Yong-Jin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 9975 - 9990
  • [4] Cyclic Defense GAN Against Speech Adversarial Attacks
    Esmaeilpour, Mohammad
    Cardinal, Patrick
    Koerich, Alessandro Lameiras
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 1769 - 1773
  • [5] RSD-GAN: Regularized Sobolev Defense GAN Against Speech-to-Text Adversarial Attacks
    Esmaeilpour, Mohammad
    Chaalia, Nourhene
    Cardinal, Patrick
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1998 - 2002
  • [6] Multimodal image synthesis based on disentanglement representations of anatomical and modality specific features, learned relativistic GAN
    Reaungamornrat, Sureerat
    Sari, Hasan
    Catana, Ciprian
    Kamen, Ali
    MEDICAL IMAGE ANALYSIS, 2022, 80
  • [7] PoT-GAN: Pose Transform GAN for Person Image Synthesis
    Li, Tianjiao
    Zhang, Wei
    Song, Ran
    Li, Zhiheng
    Liu, Jun
    Li, Xiaolei
    Lu, Shijian
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7677 - 7688
  • [8] Editable Image Generation with Consistent Unsupervised Disentanglement Based on GAN
    Yang, Gaoming
    Qu, Yuanjin
    Fang, Xianjin
    APPLIED SCIENCES-BASEL, 2022, 12 (11):
  • [9] Cross-View Image Synthesis From a Single Image With Progressive Parallel GAN
    Zhu, Yingying
    Chen, Shihai
    Lu, Xiufan
    Chen, Jianyong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [10] HiFi-GAN based Text-to-Speech Synthesis in Serbian
    Suzic, Sinisa
    Pekar, Darko
    Secujski, Milan
    Nosek, Tijana
    Delic, Vlado
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 2231 - 2235