Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

被引:6
作者
Masumura, Ryo [1 ]
Makishima, Naoki [1 ]
Ihori, Mana [1 ]
Takashima, Akihiko [1 ]
Tanaka, Tomohiro [1 ]
Orihashi, Shota [1 ]
机构
[1] NTT Corp, NTT Media Intelligence Labs, Tokyo, Japan
来源
INTERSPEECH 2020 | 2020年
关键词
end-to-end automatic speech recognition; phoneme-to-grapheme conversion; pre-training; Transformer; DATA AUGMENTATION;
D O I
10.21437/Interspeech.2020-1930
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR). Generally, it is essential to prepare speech-to-text paired data to construct end-to-end ASR models, but it is difficult to collect a large amount of such data in practice. One issue caused by data scarcity is that the performance of ASR on out-of-domain tasks different from those using the speech-to-text paired data is poor, since the mapping from the speech information to textual information is not well learned. To address this problem, we leverage a large number of phoneme-to-grapheme (P2G) paired data, which can be easily created from external texts and a rich pronunciation dictionary. The P2G conversion and end-to-end ASR are regarded as similar transformation tasks where the input phonetic information is converted into textual information. Our method utilizes the P2G conversion task for pre-training of a decoder network in Transformer encoder-decoder based end-to-end ASR. Experiments using 4 billion tokens of Web text demonstrates that the performance of ASR on out-of-domain tasks can be significantly improved by our pre-training.
引用
收藏
页码:2822 / 2826
页数:5
相关论文
共 31 条
[1]  
[Anonymous], 2019, P ICML
[2]   Direct Acoustics-to-Word Models for English Conversational Speech Recognition [J].
Audhkhasi, Kartik ;
Ramabhadran, Bhuvana ;
Saon, George ;
Picheny, Michael ;
Nahamoo, David .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :959-963
[3]  
Bahdanau Dzmitry, 2015, P INT C AC SPEECH SI, P4945
[4]   Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text [J].
Baskar, Murali Karthick ;
Watanabe, Shinji ;
Astudillo, Ramon ;
Hori, Takaaki ;
Burget, Lukas ;
Cernocky, Jan .
INTERSPEECH 2019, 2019, :3790-3794
[5]  
Bengio S, 2015, ADV NEUR IN, V28
[6]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[7]   Towards better decoding and language model integration in sequence to sequence models [J].
Chorowski, Jan ;
Jaitly, Navdeep .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :523-527
[8]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[9]   Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM [J].
Hari, Takaaki ;
Watanabe, Shinji ;
Zhang, Yu ;
Chan, William .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :949-953
[10]  
Hori T, 2019, INT CONF ACOUST SPEE, P6271, DOI [10.1109/icassp.2019.8683307, 10.1109/ICASSP.2019.8683307]