Dialect-aware Semi-supervised Learning for End-to-End Multi-dialect Speech Recognition

被引：0

作者：

Shiota, Sayaka ^{[1
]}

Imaizumi, Ryo ^{[2
]}

Masumura, Ryo ^{[1
]}

Kiya, Hitoshi ^{[1
]}

机构：

[1] Tokyo Metropolitan Univ, Dept Comp Sci, Tokyo, Japan

[2] NTT Corp, NTT Comp & Data Sci Labs, Tokyo, Japan

来源：

PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose dialect-aware semi-supervised learning for end-to-end automatic speech recognition (ASR) models considering multi-dialect speech. Some multi-domain ASR tasks require a large amount of training data containing additional information (e.g., language or dialect), whereas it is difficult to prepare such data with accurate transcriptions. Semi-supervised learning is a method of using a massive amount of untranscribed data effectively, and it can be applied to multi-domain ASR tasks to relax the missing transcriptions problem. However, semi-supervised learning has usually used generated pseudo-transcriptions only. The problem is that simply combining a multi-domain model with semi-supervised learning makes use of no additional information even though the information can be obtained. Therefore, in this paper, we focus on semi-supervised learning based on a multi-domain model that takes additional domain information into account. Since the accuracy of pseudo-transcriptions can be improved by using the multi-domain model and additional information, our proposed semi-supervised learning is expected to provide a reliable ASR model. In experiments, we performed Japanese multi-dialect ASR as one type of multi-domain ASR. From the results, a model trained with the proposed method yielded the lowest character error rate compared with other models trained with the conventional semi-supervised method.

引用

页码：240 / 244

页数：5

共 24 条

[1] Amodei D, 2016, PR MACH LEARN RES, V48
[2] [Anonymous], 2000, LREC
[3] Dalmia S, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4909, DOI 10.1109/ICASSP.2018.8461802
[4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[5] Hayashi T, 2018, IEEE W SP LANG TECH, P426, DOI 10.1109/SLT.2018.8639619
[6] Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
Higuchi, Yosuke
Moritz, Niko
Le Roux, Jonathan
Hori, Takaaki
[J]. INTERSPEECH 2021, 2021, : 726 - 730
[7] Imaizumi R., 2022, APSIPA TRANS SIGNAL
[8] Imaizumi R, 2020, ASIAPAC SIGN INFO PR, P297
[9] Kahn J, 2020, INT CONF ACOUST SPEE, P7084, DOI [10.1109/icassp40776.2020.9054295, 10.1109/ICASSP40776.2020.9054295]
[10] Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
Karita, Shigeki
Soplin, Nelson Enrique Yalta
Watanabe, Shinji
Delcroix, Marc
Ogawa, Atsunori
Nakatani, Tomohiro
[J]. INTERSPEECH 2019, 2019, : 1408 - 1412

← 1 2 3 →