HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING?

被引:84
作者
Hsu, Wei-Ning [1 ]
Tsai, Yao-Hung Hubert [2 ]
Bolte, Benjamin [1 ]
Salakhutdinov, Ruslan [2 ]
Mohamed, Abdelrahman [1 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
representation learning; pre-training;
D O I
10.1109/ICASSP39728.2021.9414460
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.
引用
收藏
页码:6533 / 6537
页数:5
相关论文
共 44 条
[1]  
[Anonymous], 2012, ACL
[2]  
[Anonymous], 1998, P DARPA BROADC NEWS
[3]  
Baevski A, 2019, ARXIV191103912
[4]  
Baevski A, 2020, ADV NEUR IN, V33
[5]  
Baevski Alexei, 2019, PROC INT C LEARN REP
[6]   Deep Clustering for Unsupervised Learning of Visual Features [J].
Caron, Mathilde ;
Bojanowski, Piotr ;
Joulin, Armand ;
Douze, Matthijs .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :139-156
[7]  
Chi P.-H., 2020, ARXIV200508575
[8]   Unsupervised Speech Representation Learning Using WaveNet Autoencoders [J].
Chorowski, Jan ;
Weiss, Ron J. ;
Bengio, Samy ;
van den Oord, Aaron .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) :2041-2053
[9]   An Unsupervised Autoregressive Model for Speech Representation Learning [J].
Chung, Yu-An ;
Hsu, Wei-Ning ;
Tang, Hao ;
Glass, James .
INTERSPEECH 2019, 2019, :146-150
[10]  
Chung YA, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2353