Unsupervised Code-switched Text Generation from Parallel Text

被引:3
作者
Chi, Jie [1 ]
Lu, Brian [2 ]
Eisner, Jason [2 ]
Bell, Peter [1 ]
Jyothi, Preethi [3 ]
Ali, Ahmed M. [4 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA
[3] Indian Inst Technol, Dept Comp Sci, Bombay, Maharashtra, India
[4] HBKU, Qatar Comp Res Inst, Ar Rayyan, Qatar
来源
INTERSPEECH 2023 | 2023年
关键词
code-switching; text generation; data augmentation; encoder-decoder; unsupervised learning;
D O I
10.21437/Interspeech.2023-1050
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual population. However, existing datasets are limited in size. It is expensive and difficult to collect real transcribed spoken CS data due to the challenges of finding and identifying CS data in the wild. As a result, many attempts have been made to generate synthetic CS data. Existing methods either require the existence of CS data during training, or are driven by linguistic knowledge. We introduce a novel approach of forcing a multilingual MT system that was trained on non-CS data to generate CS translations. Comparing against two prior methods, we show that simply leveraging the shared representations of two languages (Mandarin and English) yields better CS text generation and, ultimately, better CS ASR.
引用
收藏
页码:1419 / 1423
页数:5
相关论文
共 32 条
  • [1] [Anonymous], 2009, CAMBRIDGE HDB LANGUA
  • [2] [Anonymous], 2013, P 51 ANN M ASS COMP
  • [3] Bu H., 2017, Proceedings of O-COCOSDA, P1
  • [4] Calvillo J., 2020, P 2020 C EMP METH NA
  • [5] Chandu K. Raghavi, 2020, STYLE VARIATION VANT
  • [6] Chandu KR, 2018, COMPUTATIONAL APPROACHES TO LINGUISTIC CODE-SWITCHING, P92
  • [7] Chang C.-T., 2018, INTERSPEECH
  • [8] Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
    Chowdhury, Shammur Absar
    Hussein, Amir
    Abdelali, Ahmed
    Ali, Ahmed
    [J]. INTERSPEECH 2021, 2021, : 2466 - 2470
  • [9] Conneau Alexis., 2020, P 58 ANN M ASS COMPU, P6022, DOI DOI 10.18653/V1/2020.ACL-MAIN.536
  • [10] DATA AUGMENTATION FOR END-TO-END CODE-SWITCHING SPEECH RECOGNITION
    Du, Chenpeng
    Li, Hao
    Lu, Yizhou
    Wang, Lan
    Qian, Yanmin
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 194 - 200