ON-THE-FLY DATA AUGMENTATION FOR TEXT-TO-SPEECH STYLE TRANSFER

被引：2

作者：

Chung, Raymond ^{[1
,2
]}

Mak, Brian ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

[2] Logist & Supply Chain MultiTech R&D Ctr, Pok Fu Lam, Hong Kong, Peoples R China

来源：

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU) | 2021年

关键词：

text-to-speech; neural speech synthesis; scenario-based speech synthesis; newscasting speech; story-telling speech; public speaking speech;

D O I：

10.1109/ASRU51503.2021.9688074

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advanced text-to-speech (TTS) systems synthesize natural speeches. However, in many applications, it is desirable to synthesize utterances in a specific style. In this paper, we investigate synthesizing audios with three styles - news-casting, public speaking and storytelling - for a speaker who provides only neutral speech data. Firstly, considerable speech data were collected from the neutral speaker, and small amounts of speech from the wanted styles were collected from other speakers such that no speakers uttered in more than one style. All the data were used to train a basic multi-style multi-speaker TTS model. Secondly, augmented audios were created on-the-fly with the latest TTS model during its training and were used to further train the TTS model. Specifically, augmented data were created by `forcing' a speaker to imitate stylish speeches of other three speakers by requiring their attention alignment matrices as similar as possible. Objective evaluation on the rhythm and pitch profile of the synthesized speech shows that the TTS model trained with our proposed data augmentation successfully transfers speech styles in these aspects. Subjective ABX evaluation also shows that stylish speeches synthesized by our proposed method are overwhelmingly preferred than those from a baseline TTS model by 40-60%.

引用

页码：634 / 641

页数：8

共 22 条

[1]

[Anonymous], 2017, LJ SPEECH DATASET

[2]

Bae J., 2020, ARXIV PREPRINT ARXIV

[3] WHISPERED AND LOMBARD NEURAL SPEECH SYNTHESIS [J].

Hu, Qiong ;

Bleisch, Tobias ;

Petkov, Petko ;

Raitio, Tuomo ;

Marchi, Erik ;

Lakshminarasimhan, Varun .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :454-461

[4] LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH USING DATA AUGMENTATION [J].

Huybrechts, Goeric ;

Merritt, Thomas ;

Comini, Giulia ;

Perz, Bartek ;

Shah, Raahil ;

Lorenzo-Trueba, Jaime .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6593-6597

[5]

Karlapati Sri, 2020, ARXIV PREPRINT ARXIV

[6]

Kingma DP, 2014, ADV NEUR IN, V27

[7] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[8]

Liu DR, 2018, IEEE W SP LANG TECH, P640, DOI 10.1109/SLT.2018.8639672

[9] Montreal Forced Aligner: trainable text-speech alignment using Kaldi [J].

McAuliffe, Michael ;

Socolof, Michaela ;

Mihuc, Sarah ;

Wagner, Michael ;

Sonderegger, Morgan .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :498-502

[10]

Paul Dipjyoti, 2020, ARXIV PREPRINT ARXIV

← 1 2 3 →