Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

被引:2
作者
Matsubara, Keisuke [1 ]
Okamoto, Takuma [2 ]
Takashima, Ryoichi [1 ]
Takiguchi, Tetsuya [1 ]
Toda, Tomoki [3 ]
Kawai, Hisashi [2 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Kobe 6578501, Japan
[2] Natl Inst Informat & Commun Technol, Kyoto 6190289, Japan
[3] Nagoya Univ, Informat Technol Ctr, Nagoya 4648601, Japan
关键词
Vocoders; Generators; Harmonic analysis; Convolution; Real-time systems; Acoustics; Training; Fundamental frequency control; neural vocoder; speech-rate conversion; speech synthesis; GENERATION; NETWORKS; WAVENET; LPCNET; MODEL;
D O I
10.1109/TASLP.2023.3275032
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency (f(o)) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN generator. The first extension is a downsampling network, named the excitation signal network, that hierarchically receives multi-channel excitation signals corresponding to f(o). The second extension is the layerwise pitch-dependent dilated convolutional network (LW-PDCNN), which can flexibly change its receptive fields depending on the input f(o) to handle large fluctuations in f(o) for the upsampling-based HiFi-GAN generator. The proposed explicit input of excitation signals and LW-PDCNNs corresponding to f(o) are expected to realize high-quality synthesis for the normal and f(o)-conversion conditions and for the SR-conversion condition. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and text-to-speech synthesis show that the proposed method with harmonic waves corresponding to f(o) can achieve higher synthesis quality than conventional methods in all (i.e., normal, f(o)-conversion, and SR-conversion) conditions.
引用
收藏
页码:1902 / 1915
页数:14
相关论文
共 64 条
  • [1] Agiomyrgiannakis Y, 2015, INT CONF ACOUST SPEE, P4230, DOI 10.1109/ICASSP.2015.7178768
  • [2] A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis
    Ai, Yang
    Ling, Zhen-Hua
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (839-851) : 839 - 851
  • [3] A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis
    Airaksinen, Manu
    Juvela, Lauri
    Bollepalli, Bajibabu
    Yamagishi, Junichi
    Alku, Paavo
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) : 1658 - 1670
  • [4] [Anonymous], 1996, Methods for Subjective Determination of Transmission Quality
  • [5] [Anonymous], 2017, ICLR2017 WORKSH SUBM
  • [6] Chen N., 2021, INT C LEARN REPR ICL
  • [7] Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
    Chung, Hyunseung
    Lee, Sang-Hoon
    Lee, Seong-Whan
    [J]. INTERSPEECH 2021, 2021, : 3635 - 3639
  • [8] Speech Time-Scale Modification With GANs
    Cohen, Eyal
    Kreuk, Felix
    Keshet, Joseph
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1067 - 1071
  • [9] Singing voice synthesis: History, current work, and future directions
    Cook, PR
    [J]. COMPUTER MUSIC JOURNAL, 1996, 20 (03) : 38 - 46
  • [10] Engel J., 2020, PROC INT C LEARN REP