Dialect-aware Semi-supervised Learning for End-to-End Multi-dialect Speech Recognition

被引:0
作者
Shiota, Sayaka [1 ]
Imaizumi, Ryo [2 ]
Masumura, Ryo [1 ]
Kiya, Hitoshi [1 ]
机构
[1] Tokyo Metropolitan Univ, Dept Comp Sci, Tokyo, Japan
[2] NTT Corp, NTT Comp & Data Sci Labs, Tokyo, Japan
来源
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2022年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose dialect-aware semi-supervised learning for end-to-end automatic speech recognition (ASR) models considering multi-dialect speech. Some multi-domain ASR tasks require a large amount of training data containing additional information (e.g., language or dialect), whereas it is difficult to prepare such data with accurate transcriptions. Semi-supervised learning is a method of using a massive amount of untranscribed data effectively, and it can be applied to multi-domain ASR tasks to relax the missing transcriptions problem. However, semi-supervised learning has usually used generated pseudo-transcriptions only. The problem is that simply combining a multi-domain model with semi-supervised learning makes use of no additional information even though the information can be obtained. Therefore, in this paper, we focus on semi-supervised learning based on a multi-domain model that takes additional domain information into account. Since the accuracy of pseudo-transcriptions can be improved by using the multi-domain model and additional information, our proposed semi-supervised learning is expected to provide a reliable ASR model. In experiments, we performed Japanese multi-dialect ASR as one type of multi-domain ASR. From the results, a model trained with the proposed method yielded the lowest character error rate compared with other models trained with the conventional semi-supervised method.
引用
收藏
页码:240 / 244
页数:5
相关论文
共 24 条
  • [1] Amodei D, 2016, PR MACH LEARN RES, V48
  • [2] [Anonymous], 2000, LREC
  • [3] Dalmia S, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4909, DOI 10.1109/ICASSP.2018.8461802
  • [4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
  • [5] Hayashi T, 2018, IEEE W SP LANG TECH, P426, DOI 10.1109/SLT.2018.8639619
  • [6] Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
    Higuchi, Yosuke
    Moritz, Niko
    Le Roux, Jonathan
    Hori, Takaaki
    [J]. INTERSPEECH 2021, 2021, : 726 - 730
  • [7] Imaizumi R., 2022, APSIPA TRANS SIGNAL
  • [8] Imaizumi R, 2020, ASIAPAC SIGN INFO PR, P297
  • [9] Kahn J, 2020, INT CONF ACOUST SPEE, P7084, DOI [10.1109/icassp40776.2020.9054295, 10.1109/ICASSP40776.2020.9054295]
  • [10] Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
    Karita, Shigeki
    Soplin, Nelson Enrique Yalta
    Watanabe, Shinji
    Delcroix, Marc
    Ogawa, Atsunori
    Nakatani, Tomohiro
    [J]. INTERSPEECH 2019, 2019, : 1408 - 1412