AUTOREGRESSIVE VARIATIONAL AUTOENCODER WITH A HIDDEN SEMI-MARKOV MODEL-BASED STRUCTURED ATTENTION FOR SPEECH SYNTHESIS

被引:2
作者
Fujimoto, Takato [1 ]
Hashimoto, Kei [1 ]
Nankaku, Yoshihiko [1 ]
Tokuda, Keiichi [1 ]
机构
[1] Nagoya Inst Technol, Nagoya, Aichi, Japan
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
speech synthesis; variational autoencoder; autoregressive model; attention mechanism; hidden semi-Markov model;
D O I
10.1109/ICASSP43922.2022.9746158
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes an autoregressive speech synthesis model based on the variational autoencoder incorporating latent sequence representation for acoustic and linguistic features and the structure of a hidden semi-Markov model (HSMM). Although autoregressive models can provide efficient and accurate modeling of acoustic features, they have exposure bias, i.e., the mismatch between training (teacher-forcing) and inference (free-running). To overcome this problem, we introduce an autoregressive latent variable sequence, rather than using autoregressive generation of observations. Latent representation of alignment using HSMM-based structured attention mechanism enables the use of a completely consistent training algorithm for acoustic modeling with explicit duration models. Experimental results indicate that the proposed model outperformed baselines in subjective naturalness.
引用
收藏
页码:7462 / 7466
页数:5
相关论文
共 35 条
[31]   END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE [J].
Yasuda, Yusuke ;
Wang, Xin ;
Yamagishi, Junichi .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5694-5698
[32]   DurIAN: Duration Informed Attention Network For Speech Synthesis [J].
Yu, Chengzhu ;
Lu, Heng ;
Hu, Na ;
Yu, Meng ;
Weng, Chao ;
Xu, Kun ;
Liu, Peng ;
Tuo, Deyi ;
Kang, Shiyin ;
Lei, Guangzhi ;
Su, Dan ;
Yu, Dong .
INTERSPEECH 2020, 2020, :2027-2031
[33]   A hidden semi-Markov model-based speech synthesis system [J].
Zen, Heiga ;
Tokuda, Keiichi ;
Masuko, Takashi ;
Kobayasih, Takao ;
Kitamura, Tadashi .
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (05) :825-834
[34]  
Zhang JX, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4789, DOI 10.1109/ICASSP.2018.8462020
[35]   Forward-Backward Decoding Sequence for Regularizing End-to-End TTS [J].
Zheng, Yibin ;
Tao, Jianhua ;
Wen, Zhengqi ;
Yi, Jiangyan .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) :2067-2079