FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

被引:0
作者
Shen, Rubing [2 ]
Ren, Yanzhen [1 ,2 ]
Sung, Zongkun [2 ]
机构
[1] Minist Educ, Key Lab Aerosp Informat Secur & Trusted Comp, Beijing, Peoples R China
[2] Wuhan Univ, Sch Cyber Sci & Engn, Wuhan, Peoples R China
来源
INTERSPEECH 2024 | 2024年
关键词
Speech synthesis; generative adversarial networks; spectral artifacts; frequency domain;
D O I
10.21437/Interspeech.2024-380
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable spectral artifacts, resulting in the quality decline of synthesized speech. In this work, we adopt a novel GAN-based vocoder designed for few artifacts and high fidelity, called FA-GAN. To suppress the aliasing artifacts caused by non-ideal upsampling layers in high-frequency components, we introduce the anti-aliased twin deconvolution module in the generator. To alleviate blurring artifacts and enrich the reconstruction of spectral details, we propose a novel fine-grained multi-resolution real and imaginary loss to assist in the modeling of phase information. Experimental results reveal that FA-GAN outperforms the compared approaches in promoting audio quality and alleviating spectral artifacts, and exhibits superior performance when applied to unseen speaker scenarios.
引用
收藏
页码:3884 / 3888
页数:5
相关论文
共 30 条
  • [1] Bak T., 2023, P AAAI C ART INT AAA, p12 562
  • [2] LightVoc: An Upsampling-Free GAN Vocoder Based On Conformer And Inverse Short-time Fourier Transform
    Dinh Son Dang
    Tung Lam Nguyen
    Bao Thang Ta
    Tien Thanh Nguyen
    Thi Ngoc Anh Nguyen
    Dang Linh Le
    Nhat Minh Le
    Van Hai Do
    [J]. INTERSPEECH 2023, 2023, : 3043 - 3047
  • [3] SingGAN: Generative Adversarial NetWork For High-Fidelity Singing Voice Generation
    Huang, Rongjie
    Cui, Chenye
    Chen, Feiyang
    Ren, Yi
    Liu, Jinglin
    Zhao, Zhou
    Huai, Baoxing
    Wang, Zhefeng
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 2525 - 2535
  • [4] Ito K., 2017, LJSPEECH DATASET
  • [5] Introducing Parselmouth: A Python']Python interface to Praat
    Jadoul, Yannick
    Thompson, Bill
    de Boer, Bart
    [J]. JOURNAL OF PHONETICS, 2018, 71 : 1 - 15
  • [6] UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-FidelityWaveform Generation
    Jang, Won
    Lim, Dan
    Yoon, Jaesam
    Kim, Bongwan
    Kim, Juntae
    [J]. INTERSPEECH 2021, 2021, : 2207 - 2211
  • [7] iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN
    Kaneko, Takuhiro
    Kameoka, Hirokazu
    Tanaka, Kou
    Seki, Shogo
    [J]. INTERSPEECH 2023, 2023, : 4369 - 4373
  • [8] Karras T, 2021, ADV NEUR IN, V34
  • [9] Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
    Kim, Ji-Hoon
    Lee, Sang-Hoon
    Lee, Ji-Hyun
    Lee, Seong-Whan
    [J]. INTERSPEECH 2021, 2021, : 2197 - 2201
  • [10] SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
    Koizumi, Yuma
    Zen, Heiga
    Yatabe, Kohei
    Chen, Nanxin
    Bacchiani, Michiel
    [J]. INTERSPEECH 2022, 2022, : 803 - 807