DEEP CONTEXTUALIZED ACOUSTIC REPRESENTATIONS FOR SEMI-SUPERVISED SPEECH RECOGNITION

被引：0

作者：

Ling, Shaoshi ^{[1
]}

Liu, Yuzong ^{[1
]}

Salazar, Julian ^{[1
]}

Kirchhoff, Katrin ^{[1
]}

机构：

[1] Amazon AWS AI, Seattle, WA 98109 USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

speech recognition; acoustic representation learning; semi-supervised learning; FRAMEWORK;

D O I：

10.1109/icassp40776.2020.9053176

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly.

引用

页码：6429 / 6433

页数：5

共 29 条

[1] [Anonymous], 2019, INTERSPEECH, DOI DOI 10.1145/3314493.3314518
[2] Baevski A., 2019, arXiv1910.05453
[3] Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
Baskar, Murali Karthick
Watanabe, Shinji
Astudillo, Ramon
Hori, Takaaki
Burget, Lukas
Cernocky, Jan
[J]. INTERSPEECH 2019, 2019, : 3790 - 3794
[4] Bengio Y., 2006, P ADV NEURALINF PROC, P153, DOI DOI 10.5555/2976456.2976476
[5] Unsupervised Speech Representation Learning Using WaveNet Autoencoders
Chorowski, Jan
Weiss, Ron J.
Bengio, Samy
van den Oord, Aaron
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) : 2041 - 2053
[6] Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech
Chung, Yu-An
Glass, James
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 811 - 815
[7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8] Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition
Dey, Subhadeep
Motlicek, Petr
Bui, Trung
Dernoncourt, Franck
[J]. INTERSPEECH 2019, 2019, : 734 - 738
[9] Grandvalet Yves, 2004, NeurIPS
[10] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]

← 1 2 3 →