LegoNN: Building Modular Encoder-Decoder Models

被引：3

作者：

Dalmia, Siddharth ^{[1
]}

Okhonko, Dmytro ^{[4
]}

Lewis, Mike ^{[2
]}

Edunov, Sergey ^{[2
]}

Watanabe, Shinji ^{[1
]}

Metze, Florian ^{[1
,2
]}

Zettlemoyer, Luke ^{[2
]}

Mohamed, Abdelrahman ^{[3
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Meta Platforms Inc, Menlo Pk, CA 94025 USA

[3] Rembrand Inc, Palo Alto, CA 94062 USA

[4] Samaya AI, Mountain View, CA 94040 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

End-to-end; encoder-decoder models; modularity; speech recognition; machine translation; TRANSFORMER;

D O I：

10.1109/TASLP.2023.3296019

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

State-of-the-art encoder-decoder models (e.g. for machine translation (MT) or automatic speech recognition (ASR)) are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts, e.g. a high resourced decoder, across tasks. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning. To achieve this reusability, the interface between encoder and decoder modules is grounded to a sequence of marginal distributions over a pre-defined discrete vocabulary. We present two approaches for ingesting these marginals; one is differentiable, allowing the flow of gradients across the entire network, and the other is gradient-isolating. To enable the portability of decoder modules between MT tasks for different source languages and across other tasks like ASR, we introduce a modality agnostic encoder which consists of a length control mechanism to dynamically adapt encoders' output lengths in order to match the expected input length range of pre-trained decoders. We present several experiments to demonstrate the effectiveness of LegoNN models: a trained language generation LegoNN decoder module from German-English (De-En) MT task can be reused without any fine-tuning for the Europarl English ASR and the Romanian-English (Ro-En) MT tasks, matching or beating the performance of baseline. After fine-tuning, LegoNN models improve the Ro-En MT task by 1.5 BLEU points and achieve 12.5% relative WER reduction on the Europarl ASR task. To show how the approach generalizes, we compose a LegoNN ASR model from three modules - each has been learned within different end-to-end trained models on three different datasets - achieving an overall WER reduction of 19.5%.

引用

页码：3112 / 3126

页数：15

共 73 条

[1] Abadi M., 2015, TensorFlow. Large-Scale Machine Learning on Heterogeneous Systems, V1
[2] Amodei D., 2016, P INT C MACH LEARN, P173
[3] Andreas J, 2017, PR MACH LEARN RES, V70
[4] Neural Module Networks
Andreas, Jacob
Rohrbach, Marcus
Darrell, Trevor
Klein, Dan
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 39 - 48
[5] [Anonymous], 2020, ARXIV
[6] Ba JL., 2016, arXiv
[7] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[8] Baldwin C. Y., 1999, Design Rules: The Power of Modularity, V1
[9] Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
[10] AudioLM: A Language Modeling Approach to Audio Generation
Borsos, Zalan
Marinier, Raphael
Vincent, Damien
Kharitonov, Eugene
Pietquin, Olivier
Sharifi, Matt
Roblek, Dominik
Teboul, Olivier
Grangier, David
Tagliasacchi, Marco
Zeghidour, Neil
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2523 - 2533

← 1 2 3 4 5 6 7 8 →