Unsupervised Code-switched Text Generation from Parallel Text

被引:3
作者
Chi, Jie [1 ]
Lu, Brian [2 ]
Eisner, Jason [2 ]
Bell, Peter [1 ]
Jyothi, Preethi [3 ]
Ali, Ahmed M. [4 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA
[3] Indian Inst Technol, Dept Comp Sci, Bombay, Maharashtra, India
[4] HBKU, Qatar Comp Res Inst, Ar Rayyan, Qatar
来源
INTERSPEECH 2023 | 2023年
关键词
code-switching; text generation; data augmentation; encoder-decoder; unsupervised learning;
D O I
10.21437/Interspeech.2023-1050
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual population. However, existing datasets are limited in size. It is expensive and difficult to collect real transcribed spoken CS data due to the challenges of finding and identifying CS data in the wild. As a result, many attempts have been made to generate synthetic CS data. Existing methods either require the existence of CS data during training, or are driven by linguistic knowledge. We introduce a novel approach of forcing a multilingual MT system that was trained on non-CS data to generate CS translations. Comparing against two prior methods, we show that simply leveraging the shared representations of two languages (Mandarin and English) yields better CS text generation and, ultimately, better CS ASR.
引用
收藏
页码:1419 / 1423
页数:5
相关论文
共 32 条
  • [11] Code-Switching Sentence Generation by Bert and Generative Adversarial Networks
    Gao, Yingying
    Feng, Junlan
    Liu, Ying
    Hou, Leijing
    Pan, Xin
    Ma, Yong
    [J]. INTERSPEECH 2019, 2019, : 3525 - 3529
  • [12] TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation
    Hernandez, Francois
    Nguyen, Vincent
    Ghannay, Sahar
    Tomashenko, Natalia
    Esteve, Yannick
    [J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 198 - 208
  • [13] Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search
    Hokamp, Chris
    Liu, Qun
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1535 - 1546
  • [14] Hussein A., 2022, CODE SWITCHING TEXT
  • [15] Johnson M., 2017, Transactions of the Association for Computational Linguistics, V5, P339
  • [16] Kitaev N., 2018, CoRR
  • [17] Lyu DC, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1986
  • [18] Mikolov T., 2013, Efficient estimation of word representations in vector space
  • [19] Muller B., 2021, 1 ALIGN THEN PREDICT
  • [20] ON STRUCTURING PROBABILISTIC DEPENDENCES IN STOCHASTIC LANGUAGE MODELING
    NEY, H
    ESSEN, U
    KNESER, R
    [J]. COMPUTER SPEECH AND LANGUAGE, 1994, 8 (01) : 1 - 38