Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

被引：3

作者：

Taylor, Jason ^{[1
]}

Richmond, Korin ^{[1
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

来源：

INTERSPEECH 2020 | 2020年

关键词：

Speech Synthesis; Sequence-to-Sequence; Morphology; Pronunciation;

D O I：

10.21437/Interspeech.2020-1547

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into subword units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.

引用

页码：1738 / 1742

页数：5

共 50 条

[31] High Performance Sequence-to-Sequence Model for Streaming Speech Recognition
Thai-Son Nguyen
Ngoc-Quan Pham
Stueker, Sebastian
Waibel, Alex
INTERSPEECH 2020, 2020, : 2147 - 2151
[32] UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis
Zhou, Xiao
Ling, Zhen-Hua
Dai, Li-Rong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2643 - 2655
[33] SPEECH-TRANSFORMER: A NO-RECURRENCE SEQUENCE-TO-SEQUENCE MODEL FOR SPEECH RECOGNITION
Dong, Linhao
Xu, Shuang
Xu, Bo
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5884 - 5888
[34] STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
Chiu, Chung-Cheng
Sainath, Tara N.
Wu, Yonghui
Prabhavalkar, Rohit
Nguyen, Patrick
Chen, Zhifeng
Kannan, Anjuli
Weiss, Ron J.
Rao, Kanishka
Gonina, Ekaterina
Jaitly, Navdeep
Li, Bo
Chorowski, Jan
Bacchiani, Michiel
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4774 - 4778
[35] COUPLED TRAINING OF SEQUENCE-TO-SEQUENCE MODELS FOR ACCENTED SPEECH RECOGNITION
Unni, Vinit
Joshi, Nitish
Jyothi, Preethi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8254 - 8258
[36] MANDARIN ELECTROLARYNGEAL SPEECH VOICE CONVERSION WITH SEQUENCE-TO-SEQUENCE MODELING
Yen, Ming-Chi
Huang, Wen-Chin
Kobayashi, Kazuhiro
Peng, Yu-Huai
Tsai, Shu-Wei
Tsao, Yu
Toda, Tomoki
Jang, Jyh-Shing Roger
Wang, Hsin-Min
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 650 - 657
[37] Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language
Li, Huiyan
Lin, Haohong
Wang, You
Wang, Hengyang
Zhang, Ming
Gao, Han
Ai, Qing
Luo, Zhiyuan
Li, Guang
BRAIN SCIENCES, 2022, 12 (07)
[38] FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS
Zhang, Jing-Xuan
Ling, Zhen-Hua
Dai, Li-Rong
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4789 - 4793
[39] Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information
Zhang, Weizhao
Yang, Hongwu
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
[40] CORRECTION OF AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMER SEQUENCE-TO-SEQUENCE MODEL
Hrinchuk, Oleksii
Popova, Mariya
Ginsburg, Boris
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7074 - 7078

← 1 2 3 4 5 →