Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

被引:0
|
作者
Kashiwagi, Yosuke [1 ]
Futami, Hayato [1 ]
Tsunoo, Emiru [1 ]
Arora, Siddhant [2 ]
Watanabe, Shinji [2 ]
机构
[1] Sony Grp Corp, Minato City, Tokyo, Japan
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
INTERSPEECH 2024 | 2024年
关键词
speech recognition; E2E; multi-lingual; prompting; adaptation;
D O I
10.21437/Interspeech.2024-702
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
引用
收藏
页码:2900 / 2904
页数:5
相关论文
共 22 条
  • [21] Speech recognition model design for Sundanese language using WAV2VEC 2.0
    Cryssiover A.
    Zahra A.
    International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
  • [22] New tool for approaching E-learning:: Videorder™ -: Videorder™ voice-based speech recognition and language processing search technology with Finder™ engine
    Kiss, Ferenc
    Bassa, Lia
    Justin, Viktor
    ICEIS 2007: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS: HUMAN-COMPUTER INTERACTION, 2007, : 314 - +