EFFICIENT ADAPTER TRANSFER OF SELF-SUPERVISED SPEECH MODELS FOR AUTOMATIC SPEECH RECOGNITION

被引:47
作者
Thomas, Bethan [1 ]
Kessler, Samuel [1 ,2 ]
Karout, Salah [1 ]
机构
[1] Huawei R&D UK, Cambridge, England
[2] Univ Oxford, Oxford, England
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Automatic Speech Recognition; SelfSupervision; Adapters; Transfer Learning;
D O I
10.1109/ICASSP43922.2022.9746223
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning (SSL) is a powerful tool that allows learning of underlying representations from unlabeled data. Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain. Generally these models are fine-tuned on a small amount of labeled data for a downstream task such as Automatic Speech Recognition (ASR). This involves re-training the majority of the model for each task. Adapters are small lightweight modules which are commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. In this paper we propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks, and increase scalability of the model to multiple tasks or languages. Using adapters we can perform ASR while training fewer than 10% of parameters per task compared to full fine-tuning with little degradation of performance. Ablations show that applying adapters into just the top few layers of the pre-trained network gives similar performance to full transfer, supporting the theory that higher pre-trained layers encode more phonemic information, and further optimizing efficiency.
引用
收藏
页码:7102 / 7106
页数:5
相关论文
共 20 条
[1]  
Ardila Rosana, 2019, ARXIV191206670
[2]  
Baevski A., 2020, Advances in neural information processing systems
[3]  
Bapna A, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1538
[4]   An Unsupervised Autoregressive Model for Speech Representation Learning [J].
Chung, Yu-An ;
Hsu, Wei-Ning ;
Tang, Hao ;
Glass, James .
INTERSPEECH 2019, 2019, :146-150
[5]  
Conneau Alexis, 2019, Unsupervised cross-lingual representation learning at scale, DOI DOI 10.18653/V1/2020.ACL-MAIN.747
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Houlsby N, 2019, PR MACH LEARN RES, V97
[8]  
Hsu W.-N., 2021, ARXIV210607447
[9]  
Kahn J, 2020, INT CONF ACOUST SPEE, P7669, DOI [10.1109/ICASSP40776.2020.9052942, 10.1109/icassp40776.2020.9052942]
[10]   Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model [J].
Kannan, Anjuli ;
Datta, Arindrima ;
Sainath, Tara N. ;
Weinstein, Eugene ;
Ramabhadran, Bhuvana ;
Wu, Yonghui ;
Bapna, Ankur ;
Chen, Zhifeng ;
Lee, Seungji .
INTERSPEECH 2019, 2019, :2130-2134