Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

被引:2
作者
Matsubara, Keisuke [1 ]
Okamoto, Takuma [2 ]
Takashima, Ryoichi [1 ]
Takiguchi, Tetsuya [1 ]
Toda, Tomoki [3 ]
Kawai, Hisashi [2 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Kobe 6578501, Japan
[2] Natl Inst Informat & Commun Technol, Kyoto 6190289, Japan
[3] Nagoya Univ, Informat Technol Ctr, Nagoya 4648601, Japan
关键词
Vocoders; Generators; Harmonic analysis; Convolution; Real-time systems; Acoustics; Training; Fundamental frequency control; neural vocoder; speech-rate conversion; speech synthesis; GENERATION; NETWORKS; WAVENET; LPCNET; MODEL;
D O I
10.1109/TASLP.2023.3275032
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency (f(o)) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN generator. The first extension is a downsampling network, named the excitation signal network, that hierarchically receives multi-channel excitation signals corresponding to f(o). The second extension is the layerwise pitch-dependent dilated convolutional network (LW-PDCNN), which can flexibly change its receptive fields depending on the input f(o) to handle large fluctuations in f(o) for the upsampling-based HiFi-GAN generator. The proposed explicit input of excitation signals and LW-PDCNNs corresponding to f(o) are expected to realize high-quality synthesis for the normal and f(o)-conversion conditions and for the SR-conversion condition. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and text-to-speech synthesis show that the proposed method with harmonic waves corresponding to f(o) can achieve higher synthesis quality than conventional methods in all (i.e., normal, f(o)-conversion, and SR-conversion) conditions.
引用
收藏
页码:1902 / 1915
页数:14
相关论文
共 64 条
  • [21] Kong Jungil, 2020, ADV NEUR IN, V33
  • [22] Kong Z., 2021, P INT C LEARN REPR V
  • [23] Kumar K, 2019, ADV NEUR IN, V32
  • [24] Larsen ABL, 2016, PR MACH LEARN RES, V48
  • [25] Neural Homomorphic Vocoder
    Liu, Zhijun
    Chen, Kuan
    Yu, Kai
    [J]. INTERSPEECH 2020, 2020, : 240 - 244
  • [26] Towards achieving robust universal neural vocoding
    Lorenzo-Trueba, Jaime
    Drugman, Thomas
    Latorre, Javier
    Merritt, Thomas
    Putrycz, Bartosz
    Barra-Chicote, Roberto
    Moinet, Alexis
    Aggarwal, Vatsal
    [J]. INTERSPEECH 2019, 2019, : 181 - 185
  • [27] Comparison of real-time multi-speaker neural vocoders on CPUs
    Matsubara, Keisuke
    Okamoto, Takuma
    Takashima, Ryoichi
    Takiguchi, Tetsuya
    Toda, Tomoki
    Kawai, Hisashi
    [J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2022, 43 (02) : 121 - 124
  • [28] Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU
    Matsubara, Keisuke
    Okamoto, Takuma
    Takashima, Ryoichi
    Takiguchi, Tetsuya
    Toda, Tomoki
    Shiga, Yoshinori
    Kawai, Hisashi
    [J]. IEEE ACCESS, 2021, 9 : 94923 - 94933
  • [29] SPEECH ANALYSIS SYNTHESIS BASED ON A SINUSOIDAL REPRESENTATION
    MCAULAY, RJ
    QUATIERI, TF
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1986, 34 (04): : 744 - 754
  • [30] An overview of voice conversion systems
    Mohammadi, Seyed Hamidreza
    Kain, Alexander
    [J]. SPEECH COMMUNICATION, 2017, 88 : 65 - 82