Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

被引:0
|
作者
Kashiwagi, Yosuke [1 ]
Futami, Hayato [1 ]
Tsunoo, Emiru [1 ]
Arora, Siddhant [2 ]
Watanabe, Shinji [2 ]
机构
[1] Sony Grp Corp, Minato City, Tokyo, Japan
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
INTERSPEECH 2024 | 2024年
关键词
speech recognition; E2E; multi-lingual; prompting; adaptation;
D O I
10.21437/Interspeech.2024-702
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
引用
收藏
页码:2900 / 2904
页数:5
相关论文
共 22 条
  • [1] Dual Script E2E Framework for Multilingual and Code-Switching ASR
    Kumar, Mari Ganesh
    Kuriakose, Jom
    Thyagachandran, Anand
    Kumar, Arun A.
    Seth, Ashish
    Prasad, Lodagala V. S. V. Durga
    Jaiswal, Saish
    Prakash, Anusha
    Murthy, Hema A.
    INTERSPEECH 2021, 2021, : 2441 - 2445
  • [2] Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
    Yuan Shangguan
    Prabhavalkar, Rohit
    Hang Su
    Mahadeokar, Jay
    Shi, Yangyang
    Zhou, Jiatong
    Wu, Chunyang
    Duc Le
    Kalinli, Ozlem
    Fuegen, Christian
    Seltzer, Michael L.
    INTERSPEECH 2021, 2021, : 4553 - 4557
  • [3] Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
    Ma, Guodong
    Hu, Pengfei
    Kang, Jian
    Huang, Shen
    Huang, Hao
    INTERSPEECH 2021, 2021, : 306 - 310
  • [4] A LIKELIHOOD RATIO BASED DOMAIN ADAPTATION METHOD FOR E2E MODELS
    Choudhury, Chhavi
    Gandhe, Ankur
    Ding, Xiaohan
    Bulyko, Ivan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6762 - 6766
  • [5] Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods
    Ye, Lingxuan
    Cheng, Gaofeng
    Yang, Runyan
    Yang, Zehui
    Tian, Sanli
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 3163 - 3167
  • [6] Entropy-Based Dynamic Rescoring with Language Model in E2E ASR Systems
    Gong, Zhuo
    Saito, Daisuke
    Minematsu, Nobuaki
    APPLIED SCIENCES-BASEL, 2022, 12 (19):
  • [7] Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages
    Sailor, Hardik B.
    Hain, Thomas
    INTERSPEECH 2020, 2020, : 4756 - 4760
  • [8] PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
    Ma, Guodong
    Hu, Pengfei
    Yolwas, Nurmemet
    Huang, Shen
    Huang, Hao
    INTERSPEECH 2022, 2022, : 1021 - 1025
  • [9] Factored Language Model Adaptation Using Dirichlet Class Language Model for Speech Recognition
    Hatami, Ali
    Akbari, Ahmad
    Nasersharif, Babak
    2013 5TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2013, : 438 - 442
  • [10] Multilingual E-mail Classification using Bayesian Filtering and Language Translation
    Banday, M. Tariq
    Sheikh, Shafiya Afzal
    2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 696 - 701