MULTI-RATE ATTENTION ARCHITECTURE FOR FAST STREAMABLE TEXT-TO-SPEECH SPECTRUM MODELING

被引：4

作者：

He, Qing ^{[1
]}

Xiu, Zhiping ^{[1
]}

Koehler, Thilo ^{[1
]}

Wu, Jilong ^{[1
]}

机构：

[1] Facebook AI, Menlo Pk, CA 94025 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

text-to-speech; spectrum model; attention;

D O I：

10.1109/ICASSP39728.2021.9414809

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O(L) increase in both latency and real-time factor (RTF) with respect to input length L. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.

引用

页码：5689 / 5693

页数：5

共 15 条

[1]

Cuturi M, 2017, PR MACH LEARN RES, V70

[2]

Gao Yang, 2020, ARXIV200206758

[3]

International Phonetic Association, 2015, INT PHON ALPH

[4]

Kalchbrenner N, 2018, PR MACH LEARN RES, V80

[5]

Kingma DP, 2014, ADV NEUR IN, V27

[6]

Li NH, 2019, AAAI CONF ARTIF INTE, P6706

[7]

Luong T., 2015, P C EMP METH NAT LAN, P1412, DOI DOI 10.18653/V1/D15-1166

[8]

Nvidia, TAC 2 PYT IMPL FAST

[9]

Ping Wei, 2017, ARXIV171007654

[10]

Prenger R, 2019, INT CONF ACOUST SPEE, P3617, DOI [10.1109/ICASSP.2019.8683143, 10.1109/icassp.2019.8683143]

← 1 2 →