Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

被引：2

作者：

Matsubara, Keisuke ^{[1
]}

Okamoto, Takuma ^{[2
]}

Takashima, Ryoichi ^{[1
]}

Takiguchi, Tetsuya ^{[1
]}

Toda, Tomoki ^{[3
]}

Kawai, Hisashi ^{[2
]}

机构：

[1] Kobe Univ, Grad Sch Syst Informat, Kobe 6578501, Japan

[2] Natl Inst Informat & Commun Technol, Kyoto 6190289, Japan

[3] Nagoya Univ, Informat Technol Ctr, Nagoya 4648601, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Vocoders; Generators; Harmonic analysis; Convolution; Real-time systems; Acoustics; Training; Fundamental frequency control; neural vocoder; speech-rate conversion; speech synthesis; GENERATION; NETWORKS; WAVENET; LPCNET; MODEL;

D O I：

10.1109/TASLP.2023.3275032

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency (f(o)) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN generator. The first extension is a downsampling network, named the excitation signal network, that hierarchically receives multi-channel excitation signals corresponding to f(o). The second extension is the layerwise pitch-dependent dilated convolutional network (LW-PDCNN), which can flexibly change its receptive fields depending on the input f(o) to handle large fluctuations in f(o) for the upsampling-based HiFi-GAN generator. The proposed explicit input of excitation signals and LW-PDCNNs corresponding to f(o) are expected to realize high-quality synthesis for the normal and f(o)-conversion conditions and for the SR-conversion condition. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and text-to-speech synthesis show that the proposed method with harmonic waves corresponding to f(o) can achieve higher synthesis quality than conventional methods in all (i.e., normal, f(o)-conversion, and SR-conversion) conditions.

引用

页码：1902 / 1915

页数：14

共 64 条

[1] Agiomyrgiannakis Y, 2015, INT CONF ACOUST SPEE, P4230, DOI 10.1109/ICASSP.2015.7178768
[2] A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis
Ai, Yang
Ling, Zhen-Hua
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (839-851) : 839 - 851
[3] A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis
Airaksinen, Manu
Juvela, Lauri
Bollepalli, Bajibabu
Yamagishi, Junichi
Alku, Paavo
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) : 1658 - 1670
[4] [Anonymous], 1996, Methods for Subjective Determination of Transmission Quality
[5] [Anonymous], 2017, ICLR2017 WORKSH SUBM
[6] Chen N., 2021, INT C LEARN REPR ICL
[7] Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
Chung, Hyunseung
Lee, Sang-Hoon
Lee, Seong-Whan
[J]. INTERSPEECH 2021, 2021, : 3635 - 3639
[8] Speech Time-Scale Modification With GANs
Cohen, Eyal
Kreuk, Felix
Keshet, Joseph
[J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1067 - 1071
[9] Singing voice synthesis: History, current work, and future directions
Cook, PR
[J]. COMPUTER MUSIC JOURNAL, 1996, 20 (03) : 38 - 46
[10] Engel J., 2020, PROC INT C LEARN REP

← 1 2 3 4 5 6 7 →