Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer

被引:1
作者
Huang, Lu [1 ]
Li, Boyu [1 ]
Zhang, Jun [1 ]
Lu, Lu [1 ]
Ma, Zejun [1 ]
机构
[1] ByteDance, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
automatic speech recognition; text-only; domain adaptation; conformer transducer;
D O I
10.21437/Interspeech.2023-1313
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Domain adaptation using text-only corpus is challenging in end-to-end(E2E) speech recognition. Adaptation by synthesizing audio from text through TTS is resource-consuming. We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus. Different from the previous textogram method, an extra text encoder is introduced in our work to learn text representation and is removed during inference, so there is no modification for online deployment. To improve the efficiency of adaptation, single-step and multistep adaptations are also explored. The experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain, which is better than those of TTS method and textogram method. Also, it is shown the proposed method can be combined with internal language model estimation(ILME) to further improve the performance.
引用
收藏
页码:386 / 390
页数:5
相关论文
共 39 条
[1]  
[Anonymous], 2006, P 23 INT C MACH LEAR, DOI DOI 10.1145/1143844.1143891
[2]  
Ao JY, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P5723
[3]  
Bapna A., 2022, arXiv
[4]  
Bapna A, 2021, Arxiv, DOI arXiv:2110.10329
[5]  
Bataev V, 2023, Arxiv, DOI arXiv:2302.14036
[6]   MAESTRO: Matched Speech Text Representations through Modality Matching [J].
Chen, Zhehuai ;
Zhang, Yu ;
Rosenberg, Andrew ;
Ramabhadran, Bhuvana ;
Moreno, Pedro J. ;
Bapna, Ankur ;
Zen, Heiga .
INTERSPEECH 2022, 2022, :4093-4097
[7]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[8]  
Chorowski J, 2015, ADV NEUR IN, V28
[9]  
Chung YA, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P1897
[10]  
Graves A., 2012, ICML