Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

被引：0

作者：

Kashiwagi, Yosuke ^{[1
]}

Futami, Hayato ^{[1
]}

Tsunoo, Emiru ^{[1
]}

Arora, Siddhant ^{[2
]}

Watanabe, Shinji ^{[2
]}

机构：

[1] Sony Grp Corp, Minato City, Tokyo, Japan

[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

INTERSPEECH 2024 | 2024年

关键词：

speech recognition; E2E; multi-lingual; prompting; adaptation;

D O I：

10.21437/Interspeech.2024-702

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.

引用

页码：2900 / 2904

页数：5

共 22 条

[1] Dual Script E2E Framework for Multilingual and Code-Switching ASR
Kumar, Mari Ganesh
Kuriakose, Jom
Thyagachandran, Anand
Kumar, Arun A.
Seth, Ashish
Prasad, Lodagala V. S. V. Durga
Jaiswal, Saish
Prakash, Anusha
Murthy, Hema A.
INTERSPEECH 2021, 2021, : 2441 - 2445
[2] Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
Yuan Shangguan
Prabhavalkar, Rohit
Hang Su
Mahadeokar, Jay
Shi, Yangyang
Zhou, Jiatong
Wu, Chunyang
Duc Le
Kalinli, Ozlem
Fuegen, Christian
Seltzer, Michael L.
INTERSPEECH 2021, 2021, : 4553 - 4557
[3] Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
Ma, Guodong
Hu, Pengfei
Kang, Jian
Huang, Shen
Huang, Hao
INTERSPEECH 2021, 2021, : 306 - 310
[4] A LIKELIHOOD RATIO BASED DOMAIN ADAPTATION METHOD FOR E2E MODELS
Choudhury, Chhavi
Gandhe, Ankur
Ding, Xiaohan
Bulyko, Ivan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6762 - 6766
[5] Improving Recognition of Out-of-vocabulary Words in E2E Code-switching ASR by Fusing Speech Generation Methods
Ye, Lingxuan
Cheng, Gaofeng
Yang, Runyan
Yang, Zehui
Tian, Sanli
Zhang, Pengyuan
Yan, Yonghong
INTERSPEECH 2022, 2022, : 3163 - 3167
[6] Entropy-Based Dynamic Rescoring with Language Model in E2E ASR Systems
Gong, Zhuo
Saito, Daisuke
Minematsu, Nobuaki
APPLIED SCIENCES-BASEL, 2022, 12 (19):
[7] Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages
Sailor, Hardik B.
Hain, Thomas
INTERSPEECH 2020, 2020, : 4756 - 4760
[8] PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
Ma, Guodong
Hu, Pengfei
Yolwas, Nurmemet
Huang, Shen
Huang, Hao
INTERSPEECH 2022, 2022, : 1021 - 1025
[9] Factored Language Model Adaptation Using Dirichlet Class Language Model for Speech Recognition
Hatami, Ali
Akbari, Ahmad
Nasersharif, Babak
2013 5TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2013, : 438 - 442
[10] Multilingual E-mail Classification using Bayesian Filtering and Language Translation
Banday, M. Tariq
Sheikh, Shafiya Afzal
2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 696 - 701

← 1 2 3 →