HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING?

被引:69
作者
Hsu, Wei-Ning [1 ]
Tsai, Yao-Hung Hubert [2 ]
Bolte, Benjamin [1 ]
Salakhutdinov, Ruslan [2 ]
Mohamed, Abdelrahman [1 ]
机构
[1] Facebook AI Res, Menlo Pk, CA 94025 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
representation learning; pre-training;
D O I
10.1109/ICASSP39728.2021.9414460
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.
引用
收藏
页码:6533 / 6537
页数:5
相关论文
共 44 条
  • [1] Baevski A., 2019, P INT C LEARNING REP
  • [2] Baevski A., 2020, Advances in Neural Information Processing Systems 33
  • [3] Baevski Alexei, 2019, ARXIV191103912
  • [4] Deep Clustering for Unsupervised Learning of Visual Features
    Caron, Mathilde
    Bojanowski, Piotr
    Joulin, Armand
    Douze, Matthijs
    [J]. COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 139 - 156
  • [5] Chi P.-H., 2020, ARXIV200508575
  • [6] Unsupervised Speech Representation Learning Using WaveNet Autoencoders
    Chorowski, Jan
    Weiss, Ron J.
    Bengio, Samy
    van den Oord, Aaron
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) : 2041 - 2053
  • [7] Chung Y.-A., 2020, ICASSP
  • [8] An Unsupervised Autoregressive Model for Speech Representation Learning
    Chung, Yu-An
    Hsu, Wei-Ning
    Tang, Hao
    Glass, James
    [J]. INTERSPEECH 2019, 2019, : 146 - 150
  • [9] Chung YA, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P2353
  • [10] ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS
    Clark, Kevin
    Luong, Minh-Thang
    Le, Quoc V.
    Manning, Christopher D.
    [J]. INFORMATION SYSTEMS RESEARCH, 2020,