FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

被引：0

作者：

Shen, Rubing ^{[2
]}

Ren, Yanzhen ^{[1
,2
]}

Sung, Zongkun ^{[2
]}

机构：

[1] Minist Educ, Key Lab Aerosp Informat Secur & Trusted Comp, Beijing, Peoples R China

[2] Wuhan Univ, Sch Cyber Sci & Engn, Wuhan, Peoples R China

来源：

INTERSPEECH 2024 | 2024年

关键词：

Speech synthesis; generative adversarial networks; spectral artifacts; frequency domain;

D O I：

10.21437/Interspeech.2024-380

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the aliasing artifacts caused by non-ideal upsampling layers in high-frequency components, we introduce the anti-aliased twin deconvolution module in the generator. To alleviate blurring artifacts and enrich the reconstruction of spectral details, we propose a novel fine-grained multi-resolution real and imaginary loss to assist in the modeling of phase information. Experimental results reveal that FA-GAN outperforms the compared approaches in promoting audio quality and alleviating spectral artifacts, and exhibits superior performance when applied to unseen speaker scenarios.

引用

页码：3884 / 3888

页数：5

共 30 条

[1] Bak T., 2023, P AAAI C ART INT AAA, p12 562
[2] LightVoc: An Upsampling-Free GAN Vocoder Based On Conformer And Inverse Short-time Fourier Transform
Dinh Son Dang
Tung Lam Nguyen
Bao Thang Ta
Tien Thanh Nguyen
Thi Ngoc Anh Nguyen
Dang Linh Le
Nhat Minh Le
Van Hai Do
[J]. INTERSPEECH 2023, 2023, : 3043 - 3047
[3] SingGAN: Generative Adversarial NetWork For High-Fidelity Singing Voice Generation
Huang, Rongjie
Cui, Chenye
Chen, Feiyang
Ren, Yi
Liu, Jinglin
Zhao, Zhou
Huai, Baoxing
Wang, Zhefeng
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 2525 - 2535
[4] Ito K., 2017, LJSPEECH DATASET
[5] Introducing Parselmouth: A Python']Python interface to Praat
Jadoul, Yannick
Thompson, Bill
de Boer, Bart
[J]. JOURNAL OF PHONETICS, 2018, 71 : 1 - 15
[6] UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-FidelityWaveform Generation
Jang, Won
Lim, Dan
Yoon, Jaesam
Kim, Bongwan
Kim, Juntae
[J]. INTERSPEECH 2021, 2021, : 2207 - 2211
[7] iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN
Kaneko, Takuhiro
Kameoka, Hirokazu
Tanaka, Kou
Seki, Shogo
[J]. INTERSPEECH 2023, 2023, : 4369 - 4373
[8] Karras T, 2021, ADV NEUR IN, V34
[9] Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
Kim, Ji-Hoon
Lee, Sang-Hoon
Lee, Ji-Hyun
Lee, Seong-Whan
[J]. INTERSPEECH 2021, 2021, : 2197 - 2201
[10] SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Koizumi, Yuma
Zen, Heiga
Yatabe, Kohei
Chen, Nanxin
Bacchiani, Michiel
[J]. INTERSPEECH 2022, 2022, : 803 - 807

← 1 2 3 →