DEEP CONTEXTUALIZED ACOUSTIC REPRESENTATIONS FOR SEMI-SUPERVISED SPEECH RECOGNITION

被引:0
作者
Ling, Shaoshi [1 ]
Liu, Yuzong [1 ]
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ]
机构
[1] Amazon AWS AI, Seattle, WA 98109 USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
speech recognition; acoustic representation learning; semi-supervised learning; FRAMEWORK;
D O I
10.1109/icassp40776.2020.9053176
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.
引用
收藏
页码:6429 / 6433
页数:5
相关论文
共 29 条
  • [1] [Anonymous], 2019, INTERSPEECH, DOI DOI 10.1145/3314493.3314518
  • [2] Baevski A., 2019, arXiv1910.05453
  • [3] Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
    Baskar, Murali Karthick
    Watanabe, Shinji
    Astudillo, Ramon
    Hori, Takaaki
    Burget, Lukas
    Cernocky, Jan
    [J]. INTERSPEECH 2019, 2019, : 3790 - 3794
  • [4] Bengio Y., 2006, P ADV NEURALINF PROC, P153, DOI DOI 10.5555/2976456.2976476
  • [5] Unsupervised Speech Representation Learning Using WaveNet Autoencoders
    Chorowski, Jan
    Weiss, Ron J.
    Bengio, Samy
    van den Oord, Aaron
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) : 2041 - 2053
  • [6] Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech
    Chung, Yu-An
    Glass, James
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 811 - 815
  • [7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [8] Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition
    Dey, Subhadeep
    Motlicek, Petr
    Bui, Trung
    Dernoncourt, Franck
    [J]. INTERSPEECH 2019, 2019, : 734 - 738
  • [9] Grandvalet Yves, 2004, NeurIPS
  • [10] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]