AUTOREGRESSIVE VARIATIONAL AUTOENCODER WITH A HIDDEN SEMI-MARKOV MODEL-BASED STRUCTURED ATTENTION FOR SPEECH SYNTHESIS

被引：2

作者：

Fujimoto, Takato ^{[1
]}

Hashimoto, Kei ^{[1
]}

Nankaku, Yoshihiko ^{[1
]}

Tokuda, Keiichi ^{[1
]}

机构：

[1] Nagoya Inst Technol, Nagoya, Aichi, Japan

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

speech synthesis; variational autoencoder; autoregressive model; attention mechanism; hidden semi-Markov model;

D O I：

10.1109/ICASSP43922.2022.9746158

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes an autoregressive speech synthesis model based on the variational autoencoder incorporating latent sequence representation for acoustic and linguistic features and the structure of a hidden semi-Markov model (HSMM). Although autoregressive models can provide efficient and accurate modeling of acoustic features, they have exposure bias, i.e., the mismatch between training (teacher-forcing) and inference (free-running). To overcome this problem, we introduce an autoregressive latent variable sequence, rather than using autoregressive generation of observations. Latent representation of alignment using HSMM-based structured attention mechanism enables the use of a completely consistent training algorithm for acoustic modeling with explicit duration models. Experimental results indicate that the proposed model outperformed baselines in subjective naturalness.

引用

页码：7462 / 7466

页数：5

共 35 条

[1]

[Anonymous], 2021, P ICML, DOI DOI 10.1109/ICTC52510.2021.9621008

[2]

Battenberg E, 2020, INT CONF ACOUST SPEE, P6194, DOI [10.1109/ICASSP40776.2020.9054106, 10.1109/icassp40776.2020.9054106]

[3]

Bengio S., 2015, Proceedings of the 28th International Conference on Neural Information Processing Systems, P1171

[4]

Chen N., 2021, ICLR

[5]

Donahue Jeff, 2021, P INT C LEARN REPR I

[6] Attention Forcing for Speech Synthesis [J].

Dou, Qingyun ;

Efiong, Joshua ;

Gales, Mark J. F. .

INTERSPEECH 2020, 2020, :4014-4018

[7] Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling [J].

Elias, Isaac ;

Zen, Heiga ;

Shen, Jonathan ;

Zhang, Yu ;

Jia, Ye ;

Skerry-Ryan, R. J. ;

Wu, Yonghui .

INTERSPEECH 2021, 2021, :141-145

[8]

Fujimoto T, 2020, INT CONF ACOUST SPEE, P7644, DOI [10.1109/ICASSP40776.2020.9054466, 10.1109/icassp40776.2020.9054466]

[9] A New GAN-based End-to-End TTS Training Algorithm [J].

Guo, Haohan ;

Soong, Frank K. ;

He, Lei ;

Xie, Lei .

INTERSPEECH 2019, 2019, :1288-1292

[10] Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS [J].

He, Mutian ;

Deng, Yan ;

He, Lei .

INTERSPEECH 2019, 2019, :1293-1297

← 1 2 3 4 →