High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks

被引：2

作者：

Yoneyama, Reo ^{[1
]}

Wu, Yi-Chiao ^{[1
]}

Toda, Tomoki ^{[2
]}

机构：

[1] Nagoya Univ, Grad Sch Informat, Nagoya 4648601, Japan

[2] Nagoya Univ, Informat Technol Ctr, Nagoya 4648601, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

基金：

日本学术振兴会;

关键词：

Vocoders; Controllability; Speech processing; Neural networks; Training; Mathematical models; Acoustics; Speech synthesis; neural vocoder; source-filter model; unified source-filter networks; WAVE-FORM GENERATION; SPEECH SYNTHESIS; MODEL;

D O I：

10.1109/TASLP.2023.3313410

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We introduce unified source-filter generative adversarial networks (uSFGAN), a waveform generative model conditioned on acoustic features, which represents the source-filter architecture in a generator network. Unlike the previous neural-based source-filter models in which parametric signal process modules are combined with neural networks, our approach enables unified optimization of both the source excitation generation and resonance filtering parts to achieve higher sound quality. In the uSFGAN framework, several specific regularization losses are proposed to enable the source excitation generation part to output reasonable source excitation signals. Both objective and subjective experiments are conducted, and the results demonstrate that the proposed uSFGAN achieves comparable sound quality to HiFi-GAN in the speech reconstruction task and outperforms WORLD in the F-0 transformation task. Moreover, we argue that the F-0-driven mechanism and the inductive bias obtained by source-filter modeling improve the robustness against unseen F-0 in training as shown by the results of experimental evaluations. Audio samples are available at our demo site at https://chomeyama.github.io/PitchControllableNeuralVocoder-Demo/.

引用

页码：3717 / 3729

页数：13

共 11 条

[1] A Fast High-Fidelity Source-Filter Vocoder With Lightweight Neural Modules
Yang, Runxuan
Peng, Yuyang
Hu, Xiaolin
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3362 - 3373
[2] Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN
Yoneyama, Reo
Wu, Yi-Chiao
Toda, Tomoki
INTERSPEECH 2021, 2021, : 2187 - 2191
[3] Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis
Lu, Ye-Xin
Ai, Yang
Ling, Zhen-Hua
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 68 - 80
[4] Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS
Song, Kun
Cong, Jian
Wang, Xinsheng
Zhang, Yongmao
Xie, Lei
Jiang, Ning
Wu, Haiying
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 71 - 75
[5] Reverberation Modeling for Source-Filter-based Neural Vocoder
Ai, Yang
Wang, Xin
Yamagishi, Junichi
Ling, Zhen-Hua
INTERSPEECH 2020, 2020, : 3560 - 3564
[6] High-fidelity and low-latency universal neural vocoder based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling
Tobing, Patrick Lumban
Toda, Tomoki
INTERSPEECH 2021, 2021, : 2217 - 2221
[7] FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder
Shen, Rubing
Ren, Yanzhen
Sung, Zongkun
INTERSPEECH 2024, 2024, : 3884 - 3888
[8] Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder
Yoon, Hyun-Wook
Lee, Sang-Hoon
Noh, Hyeong-Rae
Lee, Seong-Whan
INTERSPEECH 2020, 2020, : 3545 - 3549
[9] Spectral prediction method based on the transformer neural network for high-fidelity color reproduction
Li, Huailin
Zheng, Yingying
Liu, Qinsen
Sun, Bangyong
OPTICS EXPRESS, 2024, 32 (17): : 30481 - 30499
[10] Convolutional Neural Network Based Denoising for Digital Image Correlation Reconstructing High-Fidelity Deformation Field
Niu, Bangyan
Ji, Jingjing
2023 IEEE/ASME INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT MECHATRONICS, AIM, 2023, : 727 - 732

← 1 2 →