MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS

被引:0
作者
Valle, Rafael [1 ]
Li, Jason [1 ]
Prenger, Ryan [1 ]
Catanzaro, Bryan [1 ]
机构
[1] NVIDIA Corp, Santa Clara, CA 95051 USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
Text-to-Speech Synthesis; Singing Voice Synthesis; Style Transfer; Deep learning;
D O I
10.1109/icassp40776.2020.9054556
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.
引用
收藏
页码:6189 / 6193
页数:5
相关论文
共 21 条
  • [1] Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
  • [2] Chu W, 2009, INT CONF ACOUST SPEE, P3969, DOI 10.1109/ICASSP.2009.4960497
  • [3] YIN, a fundamental frequency estimator for speech and music
    de Cheveigné, A
    Kawahara, H
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) : 1917 - 1930
  • [4] Gibiansky A., 2017, P ANN C NEUR INF PRO, P2962
  • [5] Good Michael., 2001, VIRTUAL SCORE REPRES, P113
  • [6] Ito Keith, 2017, LJ SPEECH DATASET
  • [7] Kingma DP., 2017, A method for stochastic optimization, DOI DOI 10.48550/ARXIV.1412.6980
  • [8] Lee J., 2019, ARXIV190801919
  • [9] Jasper: An End-to-End Convolutional Neural Acoustic Model
    Li, Jason
    Lavrukhin, Vitaly
    Ginsburg, Boris
    Leary, Ryan
    Kuchaiev, Oleksii
    Cohen, Jonathan M.
    Nguyen, Huyen
    Gadde, Ravi Teja
    [J]. INTERSPEECH 2019, 2019, : 71 - 75
  • [10] Montreal Forced Aligner: trainable text-speech alignment using Kaldi
    McAuliffe, Michael
    Socolof, Michaela
    Mihuc, Sarah
    Wagner, Michael
    Sonderegger, Morgan
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 498 - 502