MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS

被引：0

作者：

Valle, Rafael ^{[1
]}

Li, Jason ^{[1
]}

Prenger, Ryan ^{[1
]}

Catanzaro, Bryan ^{[1
]}

机构：

[1] NVIDIA Corp, Santa Clara, CA 95051 USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

Text-to-Speech Synthesis; Singing Voice Synthesis; Style Transfer; Deep learning;

D O I：

10.1109/icassp40776.2020.9054556

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

引用

页码：6189 / 6193

页数：5

共 21 条

[1] Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[2] Chu W, 2009, INT CONF ACOUST SPEE, P3969, DOI 10.1109/ICASSP.2009.4960497
[3] YIN, a fundamental frequency estimator for speech and music
de Cheveigné, A
Kawahara, H
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (04) : 1917 - 1930
[4] Gibiansky A., 2017, P ANN C NEUR INF PRO, P2962
[5] Good Michael., 2001, VIRTUAL SCORE REPRES, P113
[6] Ito Keith, 2017, LJ SPEECH DATASET
[7] Kingma DP., 2017, A method for stochastic optimization, DOI DOI 10.48550/ARXIV.1412.6980
[8] Lee J., 2019, ARXIV190801919
[9] Jasper: An End-to-End Convolutional Neural Acoustic Model
Li, Jason
Lavrukhin, Vitaly
Ginsburg, Boris
Leary, Ryan
Kuchaiev, Oleksii
Cohen, Jonathan M.
Nguyen, Huyen
Gadde, Ravi Teja
[J]. INTERSPEECH 2019, 2019, : 71 - 75
[10] Montreal Forced Aligner: trainable text-speech alignment using Kaldi
McAuliffe, Michael
Socolof, Michaela
Mihuc, Sarah
Wagner, Michael
Sonderegger, Morgan
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 498 - 502

← 1 2 3 →