Style Variation as a Vantage Point for Code-Switching

被引:2
作者
Chandu, Khyathi Raghavi [1 ]
Black, Alan W. [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
来源
INTERSPEECH 2020 | 2020年
关键词
code-switching; style transfer; non-parallel data; adversarial training; SENTENCE;
D O I
10.21437/Interspeech.2020-2574
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Code-Switching (CS) is a prevalent phenomenon observed in bilingual and multilingual communities, especially in digital and social media platforms. A major problem in this domain is the dearth of substantial corpora to train large scale neural models. Generating vast amounts of quality synthetic text assists several downstream tasks that heavily rely on language modeling such as speech recognition, text-to-speech synthesis etc,. We present a novel vantage point of CS to be style variations between both the participating languages. Our approach does not need any external dense annotations such as lexical language ids. It relies on easily obtainable monolingual corpora without any parallel alignment and a limited set of naturally CS sentences. We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences. We present our experiments on the following pairs of languages: Spanish-English, Mandarin-English, HindiEnglish and Arabic-French. We show that the trends in metrics for generated CS move closer to real CS data in the above language pairs through the dual stage training process. We believe this viewpoint of CS as style variations opens new perspectives for modeling various tasks in CS text.
引用
收藏
页码:4761 / 4765
页数:5
相关论文
共 29 条
[1]  
[Anonymous], 2014, WORKSH FREE OPEN SOU
[2]  
[Anonymous], 2016, arXiv
[3]  
[Anonymous], 2018, ADV NEUR IN
[4]  
[Anonymous], 2018, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
[5]  
[Anonymous], 2019, P 23 C COMP NAT LANG
[6]   Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond [J].
Artetxe, Mikel ;
Schwenk, Holger .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 :597-610
[7]  
Budzianowski P, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P5016
[8]  
Chandu KR, 2018, COMPUTATIONAL APPROACHES TO LINGUISTIC CODE-SWITCHING, P92
[9]  
CHANG CT, 2018, ARXIV181102356
[10]  
Deuchar M., 2014, BUILDING BILINGUAL C