WHISPERED AND LOMBARD NEURAL SPEECH SYNTHESIS

被引:8
作者
Hu, Qiong [1 ]
Bleisch, Tobias [1 ]
Petkov, Petko [1 ]
Raitio, Tuomo [1 ]
Marchi, Erik [1 ]
Lakshminarasimhan, Varun [1 ]
机构
[1] Apple Inc, Cupertino, CA 95014 USA
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
关键词
speech synthesis; speaker adaptation; multi-speaker training; Lombard speech; whisper speech; TEXT-TO-SPEECH;
D O I
10.1109/SLT48900.2021.9383454
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.
引用
收藏
页码:454 / 461
页数:8
相关论文
共 35 条
[1]  
Adiga N, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5674, DOI 10.1109/ICASSP.2018.8462393
[2]  
[Anonymous], 2018, ADV NEURAL INFORM PR
[3]  
Arik S., 2017, P 34 INT C MACH LEAR, V70
[4]  
Bengio Y., 2009, **NON-TRADITIONAL**
[5]   Lombard Speech Synthesis using Transfer Learning in a Tacotron Text-to-Speech System [J].
Bollepalli, Bajibabu ;
Juvela, Lauri ;
Alku, Paavo .
INTERSPEECH 2019, 2019, :2833-2837
[6]   Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks [J].
Bollepalli, Bajibabu ;
Juvela, Lauri ;
Airaksinen, Manu ;
Valentini-Botinhao, Cassia ;
Alku, Paavo .
SPEECH COMMUNICATION, 2019, 110 :64-75
[7]  
Chen Yu, 2019, INT C LEARN REPR
[8]  
Cooke M, 2013, INTERSPEECH, P3519
[9]  
Cooper E, 2020, INT CONF ACOUST SPEE, P6184, DOI [10.1109/ICASSP40776.2020.9054535, 10.1109/icassp40776.2020.9054535]
[10]  
Cotescu M., 2019, IEEE SIGNAL PROC LET