Disentanglement in a GAN for Unconditional Speech Synthesis

被引：0

作者：

Baas, Matthew ^{[1
]}

Kamper, Herman ^{[1
]}

机构：

[1] Stellenbosch Univ, Dept Elect & Elect Engn, ZA-7602 Stellenbosch, South Africa

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

关键词：

Speech synthesis; Adaptation models; Task analysis; Generators; Convolution; Generative adversarial networks; Training; Unconditional speech synthesis; generative adversarial networks; speech disentanglement;

D O I：

10.1109/TASLP.2024.3359352

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) - a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks.

引用

页码：1324 / 1335

页数：12

共 50 条

[1] GAN YOU HEAR ME? RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS
Baas, Matthew
Kamper, Herman
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 906 - 911
[2] HiFi-GANw: Watermarked Speech Synthesis via Fine-Tuning of HiFi-GAN
Cheng, Xiangyu
Wang, Yaofei
Liu, Chang
Hu, Donghui
Su, Zhaopin
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2440 - 2444
[3] FEditNet plus plus : Few-Shot Editing of Latent Semantics in GAN Spaces With Correlated Attribute Disentanglement
Yi, Ran
Hu, Teng
Xia, Mengfei
Tang, Yizhe
Liu, Yong-Jin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 9975 - 9990
[4] Cyclic Defense GAN Against Speech Adversarial Attacks
Esmaeilpour, Mohammad
Cardinal, Patrick
Koerich, Alessandro Lameiras
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 1769 - 1773
[5] RSD-GAN: Regularized Sobolev Defense GAN Against Speech-to-Text Adversarial Attacks
Esmaeilpour, Mohammad
Chaalia, Nourhene
Cardinal, Patrick
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1998 - 2002
[6] Multimodal image synthesis based on disentanglement representations of anatomical and modality specific features, learned relativistic GAN
Reaungamornrat, Sureerat
Sari, Hasan
Catana, Ciprian
Kamen, Ali
MEDICAL IMAGE ANALYSIS, 2022, 80
[7] PoT-GAN: Pose Transform GAN for Person Image Synthesis
Li, Tianjiao
Zhang, Wei
Song, Ran
Li, Zhiheng
Liu, Jun
Li, Xiaolei
Lu, Shijian
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7677 - 7688
[8] Editable Image Generation with Consistent Unsupervised Disentanglement Based on GAN
Yang, Gaoming
Qu, Yuanjin
Fang, Xianjin
APPLIED SCIENCES-BASEL, 2022, 12 (11):
[9] Cross-View Image Synthesis From a Single Image With Progressive Parallel GAN
Zhu, Yingying
Chen, Shihai
Lu, Xiufan
Chen, Jianyong
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[10] HiFi-GAN based Text-to-Speech Synthesis in Serbian
Suzic, Sinisa
Pekar, Darko
Secujski, Milan
Nosek, Tijana
Delic, Vlado
2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 2231 - 2235

← 1 2 3 4 5 →