T-DVAE: A Transformer-Based Dynamical Variational Autoencoder for Speech

被引：0

作者：

Perschewski, Jan-Ole ^{[1
]}

Stober, Sebastian ^{[1
]}

机构：

[1] Otto Von Guericke Univ, AILab, Magdeburg, Germany

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VII | 2024年 / 15022卷

关键词：

Dynamical VAE; Transformer; Speech; ALGORITHM;

D O I：

10.1007/978-3-031-72350-6_3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In contrast to Variational Autoencoders, Dynamical Variational Autoencoders (DVAEs) learn a sequence of latent states for a time series. Initially, they were implemented using recurrent neural networks (RNNs) known for challenging training dynamics and problems with long-term dependencies. This led to the recent adoption of Transformers close to the RNN-based implementation. These implementations still use RNNs as part of the architecture even though the Transformer can solve the task as the sole building block. Hence, we improve the LigHT-DVAE architecture by removing the dependence on RNNs and Cross-Attention. Furthermore, we show that a trained LigHT-DVAE ignores output-to-hidden connections, which allows us to simplify the overall architecture by removing output-to-hidden connections. We demonstrate the capability of the resulting T-DVAE on librispeech and voice bank with an improvement in training time, memory consumption, and generative performance.

引用

页码：33 / 46

页数：14

共 27 条

[1]

Amodei D, 2016, PR MACH LEARN RES, V48

[2]

Bayer J, 2015, Arxiv, DOI arXiv:1411.7610

[3]

Bengio S, 2015, ADV NEUR IN, V28

[4]

Bie X., 2022, HiT-DVAE: human motion generation via hierarchical transformer dynamical VAE

[5] A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling [J].

Bie, Xiaoyu ;

Girin, Laurent ;

Leglaive, Simon ;

Hueber, Thomas ;

Alameda-Pineda, Xavier .

INTERSPEECH 2021, 2021, :46-50

[6]

Binkowski M., 2020, INT C LEARN REPR

[7]

Chung J, 2015, ADV NEUR IN, V28

[8]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[9] Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis [J].

Fevotte, Cedric ;

Bertin, Nancy ;

Durrieu, Jean-Louis .

NEURAL COMPUTATION, 2009, 21 (03) :793-830

[10]

Fraccaro M, 2016, ADV NEUR IN, V29

← 1 2 3 →