HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING?

被引：84

作者：

Hsu, Wei-Ning ^{[1
]}

Tsai, Yao-Hung Hubert ^{[2
]}

Bolte, Benjamin ^{[1
]}

Salakhutdinov, Ruslan ^{[2
]}

Mohamed, Abdelrahman ^{[1
]}

机构：

[1] Facebook AI Res, Menlo Pk, CA 94025 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

representation learning; pre-training;

D O I：

10.1109/ICASSP39728.2021.9414460

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.

引用

页码：6533 / 6537

页数：5

共 44 条

[1]

[Anonymous], 2012, ACL

[2]

[Anonymous], 1998, P DARPA BROADC NEWS

[3]

Baevski A, 2019, ARXIV191103912

[4]

Baevski A, 2020, ADV NEUR IN, V33

[5]

Baevski Alexei, 2019, PROC INT C LEARN REP

[6] Deep Clustering for Unsupervised Learning of Visual Features [J].

Caron, Mathilde ;

Bojanowski, Piotr ;

Joulin, Armand ;

Douze, Matthijs .

COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156

[7]

Chi P.-H., 2020, ARXIV200508575

[8] Unsupervised Speech Representation Learning Using WaveNet Autoencoders [J].

Chorowski, Jan ;

Weiss, Ron J. ;

Bengio, Samy ;

van den Oord, Aaron .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) :2041-2053

[9] An Unsupervised Autoregressive Model for Speech Representation Learning [J].

Chung, Yu-An ;

Hsu, Wei-Ning ;

Tang, Hao ;

Glass, James .

INTERSPEECH 2019, 2019, :146-150

[10]

Chung YA, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2353

← 1 2 3 4 5 →