NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality

被引：48

作者：

Tan, Xu ^{[1
,2
]}

Chen, Jiawei ^{[3
]}

Liu, Haohe ^{[4
]}

Cong, Jian ^{[3
]}

Zhang, Chen ^{[3
]}

Liu, Yanqing ^{[3
]}

Wang, Xi ^{[3
]}

Leng, Yichong ^{[2
]}

Yi, Yuanhao ^{[3
]}

He, Lei ^{[3
]}

Zhao, Sheng ^{[3
]}

Qin, Tao ^{[2
]}

Soong, Frank ^{[2
]}

Liu, Tie-Yan ^{[2
]}

机构：

[1] Peking Univ, Beijing 100871, Peoples R China

[2] Microsoft Res Asia, Beijing 100080, Peoples R China

[3] Microsoft Azure Speech, Beijing 100080, Peoples R China

[4] Univ Surrey, Guildford GU2 7XH, England

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 06期

基金：

英国工程与自然科学研究理事会;

关键词：

Recording; Vocoders; Decoding; Semiconductor device modeling; Guidelines; Upper bound; Training; Text-to-speech; speech synthesis; human-level quality; variational auto-encoder; end-to-end;

D O I：

10.1109/TPAMI.2024.3356232

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time.

引用

页码：4234 / 4245

页数：12

共 60 条

[1]

Arik SÖ, 2017, ADV NEUR IN, V30

[2]

Ark S., 2017, INPROCINT C MACH LEA

[3]

Bernard M., 2021, Journal of Open Source Software, V6, DOI DOI 10.21105/JOSS.03958

[4]

Chen M., 2021, P INT C LEARN REPR

[5]

Cuturi M, 2017, PR MACH LEARN RES, V70

[6]

Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

[7]

Dieleman M., 2021, P INT C LEARN REPR

[8]

Ding M, 2021, Arxiv, DOI [arXiv:2105.13290, DOI 10.48550/ARXIV.2105.13290]

[9]

Dinh L, 2015, Arxiv, DOI [arXiv:1410.8516, 10.48550/arXiv.1410.8516]

[10]

Dinh L, 2017, Arxiv, DOI arXiv:1605.08803

← 1 2 3 4 5 6 →