AUTOREGRESSIVE VARIATIONAL AUTOENCODER WITH A HIDDEN SEMI-MARKOV MODEL-BASED STRUCTURED ATTENTION FOR SPEECH SYNTHESIS

被引:2
作者
Fujimoto, Takato [1 ]
Hashimoto, Kei [1 ]
Nankaku, Yoshihiko [1 ]
Tokuda, Keiichi [1 ]
机构
[1] Nagoya Inst Technol, Nagoya, Aichi, Japan
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
speech synthesis; variational autoencoder; autoregressive model; attention mechanism; hidden semi-Markov model;
D O I
10.1109/ICASSP43922.2022.9746158
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes an autoregressive speech synthesis model based on the variational autoencoder incorporating latent sequence representation for acoustic and linguistic features and the structure of a hidden semi-Markov model (HSMM). Although autoregressive models can provide efficient and accurate modeling of acoustic features, they have exposure bias, i.e., the mismatch between training (teacher-forcing) and inference (free-running). To overcome this problem, we introduce an autoregressive latent variable sequence, rather than using autoregressive generation of observations. Latent representation of alignment using HSMM-based structured attention mechanism enables the use of a completely consistent training algorithm for acoustic modeling with explicit duration models. Experimental results indicate that the proposed model outperformed baselines in subjective naturalness.
引用
收藏
页码:7462 / 7466
页数:5
相关论文
共 35 条
[1]  
[Anonymous], 2021, P ICML, DOI DOI 10.1109/ICTC52510.2021.9621008
[2]  
Battenberg E, 2020, INT CONF ACOUST SPEE, P6194, DOI [10.1109/ICASSP40776.2020.9054106, 10.1109/icassp40776.2020.9054106]
[3]  
Bengio S., 2015, Proceedings of the 28th International Conference on Neural Information Processing Systems, P1171
[4]  
Chen N., 2021, ICLR
[5]  
Donahue Jeff, 2021, P INT C LEARN REPR I
[6]   Attention Forcing for Speech Synthesis [J].
Dou, Qingyun ;
Efiong, Joshua ;
Gales, Mark J. F. .
INTERSPEECH 2020, 2020, :4014-4018
[7]   Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling [J].
Elias, Isaac ;
Zen, Heiga ;
Shen, Jonathan ;
Zhang, Yu ;
Jia, Ye ;
Skerry-Ryan, R. J. ;
Wu, Yonghui .
INTERSPEECH 2021, 2021, :141-145
[8]  
Fujimoto T, 2020, INT CONF ACOUST SPEE, P7644, DOI [10.1109/ICASSP40776.2020.9054466, 10.1109/icassp40776.2020.9054466]
[9]   A New GAN-based End-to-End TTS Training Algorithm [J].
Guo, Haohan ;
Soong, Frank K. ;
He, Lei ;
Xie, Lei .
INTERSPEECH 2019, 2019, :1288-1292
[10]   Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS [J].
He, Mutian ;
Deng, Yan ;
He, Lei .
INTERSPEECH 2019, 2019, :1293-1297