Synthesising isiZulu-English code-switch bigrams using word embeddings

被引:11
作者
van der Westhuizen, Ewald [1 ]
Niesler, Thomas [1 ]
机构
[1] Stellenbosch Univ, Dept Elect & Elect Engn, Stellenbosch, South Africa
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
code-switching; word vectors; word embed-dings; Zulu; IsiZulu; spontaneous;
D O I
10.21437/Interspeech.2017-1437
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code-switching is prevalent among South African speakers, and presents a challenge to automatic speech recognition systems. It is predominantly a spoken phenomenon. and generally does not occur in textual form. Therefore a particularly serious challenge is the extreme lack of training material for language modelling. We investigate the use of word embeddings to synthesise isiZulu-to-English code-switch bigrams with which to augment such sparse language model training data. A variety of word embeddings are trained on a monolingual English web text corpus, and subsequently queried to synthesise code-switch bigrams. Our evaluation is performed on language models trained on a new, although small. English-isiZulu code switch corpus compiled from South African soap operas. This data is characterised by fast. spontaneously spoken speech containing frequent code-switching. We show that the augmentation of the training data with code-switched bigrams synthesised in this way leads to a reduction in perplexity.
引用
收藏
页码:72 / 76
页数:5
相关论文
共 23 条
  • [21] Stolcke A., 2002, INTERSPEECH
  • [22] van der Westhuizen E., 2016, WORKSH SPOK LANG TEC
  • [23] Yeh CF, 2012, 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, P320, DOI 10.1109/ISCSLP.2012.6423531