AUTOREGRESSIVE VARIATIONAL AUTOENCODER WITH A HIDDEN SEMI-MARKOV MODEL-BASED STRUCTURED ATTENTION FOR SPEECH SYNTHESIS

被引：2

作者：

Fujimoto, Takato ^{[1
]}

Hashimoto, Kei ^{[1
]}

Nankaku, Yoshihiko ^{[1
]}

Tokuda, Keiichi ^{[1
]}

机构：

[1] Nagoya Inst Technol, Nagoya, Aichi, Japan

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

speech synthesis; variational autoencoder; autoregressive model; attention mechanism; hidden semi-Markov model;

D O I：

10.1109/ICASSP43922.2022.9746158

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes an autoregressive speech synthesis model based on the variational autoencoder incorporating latent sequence representation for acoustic and linguistic features and the structure of a hidden semi-Markov model (HSMM). Although autoregressive models can provide efficient and accurate modeling of acoustic features, they have exposure bias, i.e., the mismatch between training (teacher-forcing) and inference (free-running). To overcome this problem, we introduce an autoregressive latent variable sequence, rather than using autoregressive generation of observations. Latent representation of alignment using HSMM-based structured attention mechanism enables the use of a completely consistent training algorithm for acoustic modeling with explicit duration models. Experimental results indicate that the proposed model outperformed baselines in subjective naturalness.

引用

页码：7462 / 7466

页数：5

共 35 条

[31] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE [J].

Yasuda, Yusuke ;

Wang, Xin ;

Yamagishi, Junichi .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5694-5698

[32] DurIAN: Duration Informed Attention Network For Speech Synthesis [J].

Yu, Chengzhu ;

Lu, Heng ;

Hu, Na ;

Yu, Meng ;

Weng, Chao ;

Xu, Kun ;

Liu, Peng ;

Tuo, Deyi ;

Kang, Shiyin ;

Lei, Guangzhi ;

Su, Dan ;

Yu, Dong .

INTERSPEECH 2020, 2020, :2027-2031

[33] A hidden semi-Markov model-based speech synthesis system [J].

Zen, Heiga ;

Tokuda, Keiichi ;

Masuko, Takashi ;

Kobayasih, Takao ;

Kitamura, Tadashi .

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (05) :825-834

[34]

Zhang JX, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4789, DOI 10.1109/ICASSP.2018.8462020

[35] Forward-Backward Decoding Sequence for Regularizing End-to-End TTS [J].

Zheng, Yibin ;

Tao, Jianhua ;

Wen, Zhengqi ;

Yi, Jiangyan .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) :2067-2079

← 1 2 3 4 →