W2V-BERT: COMBINING CONTRASTIVE LEARNING AND MASKED LANGUAGE MODELING FOR SELF-SUPERVISED SPEECH PRE-TRAINING

被引：172

作者：

Chung, Yu-An ^{[1
,2
]}

Zhang, Yu ^{[2
]}

Han, Wei ^{[2
]}

Chiu, Chung-Cheng ^{[2
]}

Qin, James ^{[2
]}

Pang, Ruoming ^{[2
]}

Wu, Yonghui ^{[2
]}

机构：

[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA

[2] Google Brain, Mountain View, CA USA

来源：

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU) | 2021年

关键词：

Self-supervised learning; representation learning; unsupervised pre-training; BERT; wav2vec; 2.0;

D O I：

10.1109/ASRU51503.2021.9688253

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Motivated by the success of masked language modeling (MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks (the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light 60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec 2.0 and HuBERT, our model shows 5% to 10% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec 2.0 by more than 30% relatively.

引用

页码：244 / 250

页数：7

共 40 条

[1]

[Anonymous], 2012, Sequence transduction with recurrent neural networks

[2]

[Anonymous], 1998, P DARPA BROADC NEWS

[3]

Baevski A., 2020, INT C LEARN REPR ICL

[4]

Baevski A, 2020, ADV NEUR IN, V33

[5]

Baevski A, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5360

[6]

Bai Jiawang, 2021, ICLR

[7] An Unsupervised Autoregressive Model for Speech Representation Learning [J].

Chung, Yu-An ;

Hsu, Wei-Ning ;

Tang, Hao ;

Glass, James .

INTERSPEECH 2019, 2019, :146-150

[8]

Chung YA, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2353

[9]

Chung YA, 2020, INT CONF ACOUST SPEE, P3497, DOI [10.1109/icassp40776.2020.9054438, 10.1109/ICASSP40776.2020.9054438]

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 →