Multi-Voice Singing Synthesis From Lyrics

被引:2
作者
Resna, S. [1 ]
Rajan, Rajeev [2 ]
机构
[1] Tata Elxsi, MultiMedia & Commun Vert, Technopk, Thiruvananthapuram, Kerala, India
[2] APJ Abdul Kalam Technol Univ, Dept Elect & Commun Engn, Coll Engn, Thiruvananthapuram, Kerala, India
关键词
Multi-speaker; Text-to-singing conversion; Singing voice synthesis; Phonetic quality;
D O I
10.1007/s00034-022-02122-3
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker's voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker's voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder-decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.
引用
收藏
页码:307 / 321
页数:15
相关论文
共 40 条
  • [1] [Anonymous], 2016, Audio texture synthesis and style transfer
  • [2] Arjovsky M, 2017, PR MACH LEARN RES, V70
  • [3] Blaauw M, 2020, INT CONF ACOUST SPEE, P7229, DOI [10.1109/icassp40776.2020.9053944, 10.1109/ICASSP40776.2020.9053944]
  • [4] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
    BOLL, SF
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
  • [5] Casanova E, 2023, Arxiv, DOI arXiv:2112.02418
  • [6] WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN
    Chandna, Pritish
    Blaauw, Merlijn
    Bonada, Jordi
    Gomez, Emilia
    [J]. 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
  • [7] Chen JW, 2020, Arxiv, DOI arXiv:2009.01776
  • [8] A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems
    Cho, Yin-Ping
    Yang, Fu-Rong
    Chang, Yung-Chuan
    Cheng, Ching-Ting
    Wang, Xiao-Han
    Liu, Yi-Wen
    [J]. 2021 4TH IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND VIRTUAL REALITY (AIVR 2021), 2021, : 319 - 323
  • [9] Choi S, 2020, INT CONF ACOUST SPEE, P7234, DOI [10.1109/icassp40776.2020.9053950, 10.1109/ICASSP40776.2020.9053950]
  • [10] Choksi B., 2017, INT J COMPUT APPL, V175, P17, DOI [10.5120/ijca2017915612, DOI 10.5120/IJCA2017915612]