The French-Algerian Code-Switching Triggered audio corpus (FACST)

被引:0
作者
Amazouz, Djegdjiga [1 ]
Adda-Decker, Martine [1 ,2 ]
Lamel, Lori [2 ]
机构
[1] Univ Sorbonne Nouvelle Paris III, LPP, CNRS, Paris, France
[2] Paris Saclay Univ, CNRS, LIMSI, Orsay, France
来源
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018) | 2018年
关键词
Code-switching; bilingual speakers; oral speech data; French; Arabic;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The French Algerian Code-Switching Triggered corpus (FACST) was created in order to support a variety of studies in phonetics, prosody and natural language processing. The first aim of the FACST corpus is to collect a spontaneous Code-switching speech (CS) corpus. In order to obtain a large quantity of spontaneous CS utterances in natural conversations experiments were carried out on how to elicit CS. Applying a triggering protocol by means of code-switched questions was found to be effective in eliciting CS in the responses. To ensure good audio quality, all recordings were made in a soundproof room or in a very calm room. This paper describes FACST corpus, along with the principal steps to build a CS speech corpus in French-Algerian languages and data collection steps. We also explain the selection criteria for the CS speakers and the recording protocols used. We present the methods used for data segmentation and annotation, and propose a conventional transcription of this type of speech in each language with the aim of being well-suited for both computational linguistic and acoustic-phonetic studies. We provide an a quantitative description of the FACST corpus along with results of linguistic studies, and discuss some of the challenges we faced in collecting CS data.
引用
收藏
页码:1468 / 1473
页数:6
相关论文
共 28 条
[1]  
Amazouz D., 2016, WORKSH CORP DRIV STU, P5
[2]  
[Anonymous], 1982, DISCOURSE STRATEGIES
[3]  
[Anonymous], 1998, LINX REV LINGUISTES
[4]  
[Anonymous], 2015, P 2 WORKSHOP ARABIC
[5]  
[Anonymous], 2014, P 18 C COMPUTATIONAL
[6]  
[Anonymous], 2002, ARABIC TRANSLITERATI
[7]  
[Anonymous], 2006, Multilingual speech processing
[8]  
Auer P., 2010, LANG SPACE INT HDB L, V1
[9]   Transcriber: Development and use of a tool for assisting speech corpora production [J].
Barras, C ;
Geoffrois, E ;
Wu, ZB ;
Liberman, M .
SPEECH COMMUNICATION, 2001, 33 (1-2) :5-22
[10]  
Bies A., 2014, Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), P93