T-DVAE: A Transformer-Based Dynamical Variational Autoencoder for Speech

被引:0
作者
Perschewski, Jan-Ole [1 ]
Stober, Sebastian [1 ]
机构
[1] Otto Von Guericke Univ, AILab, Magdeburg, Germany
来源
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VII | 2024年 / 15022卷
关键词
Dynamical VAE; Transformer; Speech; ALGORITHM;
D O I
10.1007/978-3-031-72350-6_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In contrast to Variational Autoencoders, Dynamical Variational Autoencoders (DVAEs) learn a sequence of latent states for a time series. Initially, they were implemented using recurrent neural networks (RNNs) known for challenging training dynamics and problems with long-term dependencies. This led to the recent adoption of Transformers close to the RNN-based implementation. These implementations still use RNNs as part of the architecture even though the Transformer can solve the task as the sole building block. Hence, we improve the LigHT-DVAE architecture by removing the dependence on RNNs and Cross-Attention. Furthermore, we show that a trained LigHT-DVAE ignores output-to-hidden connections, which allows us to simplify the overall architecture by removing output-to-hidden connections. We demonstrate the capability of the resulting T-DVAE on librispeech and voice bank with an improvement in training time, memory consumption, and generative performance.
引用
收藏
页码:33 / 46
页数:14
相关论文
共 27 条
[1]  
Amodei D, 2016, PR MACH LEARN RES, V48
[2]  
Bayer J, 2015, Arxiv, DOI arXiv:1411.7610
[3]  
Bengio S, 2015, ADV NEUR IN, V28
[4]  
Bie X., 2022, HiT-DVAE: human motion generation via hierarchical transformer dynamical VAE
[5]   A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling [J].
Bie, Xiaoyu ;
Girin, Laurent ;
Leglaive, Simon ;
Hueber, Thomas ;
Alameda-Pineda, Xavier .
INTERSPEECH 2021, 2021, :46-50
[6]  
Binkowski M., 2020, INT C LEARN REPR
[7]  
Chung J, 2015, ADV NEUR IN, V28
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]   Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis [J].
Fevotte, Cedric ;
Bertin, Nancy ;
Durrieu, Jean-Louis .
NEURAL COMPUTATION, 2009, 21 (03) :793-830
[10]  
Fraccaro M, 2016, ADV NEUR IN, V29