Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech

被引：5

作者：

Kakegawa, Naoto ^{[1
]}

Hara, Sunao ^{[1
]}

Abe, Masanobu ^{[1
]}

Ijima, Yusuke ^{[2
]}

机构：

[1] Okayama Univ, Grad Sch Interdisciplinary Sci & Engn Hlth Syst, Okayama, Japan

[2] NTT Corp, Tokyo, Japan

来源：

INTERSPEECH 2021 | 2021年

关键词：

Text-to-speech; Grapheme-to-Phoneme (G2P); Attention mechanism; transformer; sequence-to-sequence neural networks;

D O I：

10.21437/Interspeech.2021-914

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

The biggest obstacle to develop end-to-end Japanese text-to-speech (TTS) systems is to estimate phonetic and prosodic information (PPI) from Japanese texts. The following are the reasons: (1) the Kanji characters of the Japanese writing system have multiple corresponding pronunciations, (2) there is no separation mark between words, and (3) an accent nucleus must be assigned at appropriate positions. In this paper, we propose to solve the problems by neural machine translation (NMT) on the basis of encoder-decoder models, and compare NMT models of recurrent neural networks and the Transformer architecture. The proposed model handles texts on token (character) basis, although conventional systems handle them on word basis. To ensure the potential of the proposed approach, NMT models are trained using pairs of sentences and their PPIs that are generated by a conventional Japanese TTS system from 5 million sentences. Evaluation experiments were performed using PPIs that are manually annotated for 5,142 sentences. The experimental results showed that the Transformer architecture has the best performance, with 98.0% accuracy for phonetic information estimation and 95.0% accuracy for PPI estimation. Judging from the results, NMT models are promising toward end-to-end Japanese TTS.

引用

页码：126 / 130

页数：5

共 36 条

[21] CONVNEXT-TTS AND CONVNEXT-VC: CONVNEXT-BASED FAST END-TO-END SEQUENCE-TO-SEQUENCE TEXT-TO-SPEECH AND VOICE CONVERSION [J].

Okamoto, Takuma ;

Ohtani, Yamato ;

Toda, Tomoki ;

Kawai, Hisashi .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :12456-12460

[22] An End-to-End Chinese and Japanese Bilingual Speech Recognition Systems with Shared Character Decomposition [J].

Li, Sheng ;

Li, Jiyi ;

Liu, Qianying ;

Gong, Zhuo .

NEURAL INFORMATION PROCESSING, ICONIP 2022, PT VI, 2023, 1793 :493-503

[23] Text Only Domain Adaptation with Phoneme Guided Data Splicing for End-to-End Speech Recognition [J].

Wang, Wei ;

Gong, Xun ;

Shao, Hang ;

Yang, Dongning ;

Qian, Yanmin .

INTERSPEECH 2023, 2023, :3347-3351

[24] Automatic conversion from lexical words to prosodic words for mandarin text-to-speech system [J].

Shao, Yanqiu ;

Han, Jiqing ;

Liu, Ting ;

Zhao, Yongzhen .

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2007, 10 (01) :45-55

[25] IMPROVING END-TO-END SPEECH TRANSLATION MODEL WITH BERT-BASED CONTEXTUAL INFORMATION [J].

Bang, Jeong-Uk ;

Lee, Min-Kyu ;

Yun, Seung ;

Kim, Sang-Hun .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6227-6231

[26] Investigating Radical-based End-to-End Speech Recognition Systems for Chinese Dialects and Japanese [J].

Li, Sheng ;

Lu, Xugang ;

Ding, Chenchen ;

Shen, Peng ;

Kawahara, Tatsuya ;

Kawai, Hisashi .

INTERSPEECH 2019, 2019, :2200-2204

[27] Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data [J].

Shirahata, Yuma ;

Park, Byeongseon ;

Yamamoto, Ryuichi ;

Tachibana, Kentaro .

INTERSPEECH 2024, 2024, :2795-2799

[28] CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing [J].

Wang, Tao ;

Yi, Jiangyan ;

Fu, Ruibo ;

Tao, Jianhua ;

Wen, Zhengqi .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :2241-2254

[29] End-to-end speech recognition modeling from de-identified data [J].

Flechl, Martin ;

Yin, Shou-Chun ;

Park, Junho ;

Skala, Peter .

INTERSPEECH 2022, 2022, :1382-1386

[30] A UNIVERSAL BERT-BASED FRONT-END MODEL FOR MANDARIN TEXT-TO-SPEECH SYNTHESIS [J].

Bai, Zilong ;

Hu, Beibei .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6074-6078

← 1 2 3 4 →