Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition

被引：6

作者：

Masumura, Ryo ^{[1
]}

Makishima, Naoki ^{[1
]}

Ihori, Mana ^{[1
]}

Takashima, Akihiko ^{[1
]}

Tanaka, Tomohiro ^{[1
]}

Orihashi, Shota ^{[1
]}

机构：

[1] NTT Corp, NTT Media Intelligence Labs, Tokyo, Japan

来源：

INTERSPEECH 2020 | 2020年

关键词：

end-to-end automatic speech recognition; phoneme-to-grapheme conversion; pre-training; Transformer; DATA AUGMENTATION;

D O I：

10.21437/Interspeech.2020-1930

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper describes a simple and efficient pre-training method using a large number of external texts to enhance end-to-end automatic speech recognition (ASR). Generally, it is essential to prepare speech-to-text paired data to construct end-to-end ASR models, but it is difficult to collect a large amount of such data in practice. One issue caused by data scarcity is that the performance of ASR on out-of-domain tasks different from those using the speech-to-text paired data is poor, since the mapping from the speech information to textual information is not well learned. To address this problem, we leverage a large number of phoneme-to-grapheme (P2G) paired data, which can be easily created from external texts and a rich pronunciation dictionary. The P2G conversion and end-to-end ASR are regarded as similar transformation tasks where the input phonetic information is converted into textual information. Our method utilizes the P2G conversion task for pre-training of a decoder network in Transformer encoder-decoder based end-to-end ASR. Experiments using 4 billion tokens of Web text demonstrates that the performance of ASR on out-of-domain tasks can be significantly improved by our pre-training.

引用

页码：2822 / 2826

页数：5

共 31 条

[1]

[Anonymous], 2019, P ICML

[2] Direct Acoustics-to-Word Models for English Conversational Speech Recognition [J].

Audhkhasi, Kartik ;

Ramabhadran, Bhuvana ;

Saon, George ;

Picheny, Michael ;

Nahamoo, David .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :959-963

[3]

Bahdanau Dzmitry, 2015, P INT C AC SPEECH SI, P4945

[4] Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text [J].

Baskar, Murali Karthick ;

Watanabe, Shinji ;

Astudillo, Ramon ;

Hori, Takaaki ;

Burget, Lukas ;

Cernocky, Jan .

INTERSPEECH 2019, 2019, :3790-3794

[5]

Bengio S, 2015, ADV NEUR IN, V28

[6]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[7] Towards better decoding and language model integration in sequence to sequence models [J].

Chorowski, Jan ;

Jaitly, Navdeep .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :523-527

[8]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[9] Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM [J].

Hari, Takaaki ;

Watanabe, Shinji ;

Zhang, Yu ;

Chan, William .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :949-953

[10]

Hori T, 2019, INT CONF ACOUST SPEE, P6271, DOI [10.1109/icassp.2019.8683307, 10.1109/ICASSP.2019.8683307]

← 1 2 3 4 →