Regularizing transformers with deep probabilistic layers

被引：6

作者：

Cobo Aguilera, Aurora ^{[1
]}

Olmos, Pablo M. ^{[1
]}

Artes-Rodriguez, Antonio ^{[1
]}

Perez-Cruz, Fernando ^{[2
]}

机构：

[1] Univ Carlos III Madrid, Dept Signal Theory & Commun, Avda Univ 30, Madrid 28911, Spain

[2] Swiss Data Sci Inst ETHZ EPFL, Univ Str 25, CH-8006 Zurich, Switzerland

来源：

NEURAL NETWORKS | 2023年 / 161卷

基金：

欧洲研究理事会;

关键词：

Natural language processing; Regularization; Deep learning; Transformers; Variational auto -encoder; Missing data;

D O I：

10.1016/j.neunet.2023.01.032

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence archi-tectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.(c) 2023 Elsevier Ltd. All rights reserved.

引用

页码：565 / 574

页数：10

共 83 条

[1]

Angeli G., 2015, P 2015 C EMPIRICAL M, DOI 10.18653/v1/D15-1075

[2]

[Anonymous], 2004, NTCIR WORKSHOP

[3]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473,1409.0473, DOI 10.48550/ARXIV.1409.0473,1409.0473]

[4]

Bowman Samuel R, 2015, Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

[5]

Caccia M., 2020, P 8 INT C LEARNING R

[6]

Conneau A, 2020, Arxiv, DOI [arXiv:1911.02116, DOI 10.48550/ARXIV.1911.02116]

[7]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[8]

dAutume CD, 2019, 33 C NEURAL INFORM P, V32

[9]

Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arxiv.1810.04805]

[10]

Dilokthanakul N., 2016, Deep unsupervised clustering with gaussian mixture variational autoencoders

← 1 2 3 4 5 6 7 8 9 →