DZDC12: a new multipurpose parallel Algerian Arabizi-French code-switched corpus

被引:6
|
作者
Abainia, Kheireddine [1 ]
机构
[1] Univ 8 Mai 1945 Guelma, Fac Sci & Technol, Telecommmunicat & Elect Dept, Guelma 24000, Algeria
关键词
Parallel corpus; Arabizi; Algerian dialects; Code-switching; Text categorization; Machine translation; DIALECT; RECOGNITION;
D O I
10.1007/s10579-019-09454-8
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Algeria's socio-linguistic situation is known as a complex phenomenon involving several historical, cultural and technological factors. However, there are three languages that are mainly spoken in Algeria (Arabic, Tamazight and French) and they can be mixed in the same sentence (code-switching). Moreover, there are several varieties of dialects that differ from one region to another and sometimes within the same region. This paper aims to provide a new multi-purpose parallel corpus (i.e., DZDC12 corpus), which will serve as a testbed for various natural language processing and information retrieval applications. In particular, it can be a useful tool to study Arabic-French code-switching phenomenon, Algerian Romanized Arabic (Arabizi), different Algerian sub-dialects, sentiment analysis, gender writing style, machine translation, abuse detection, etc. To the best of our knowledge, the proposed corpus is the first of its kind, where the texts are written in Latin script and crawled from Facebook. More specifically, this corpus is organised by gender, region and city, and is transliterated into Arabic script and translated into Modern Standard Arabic. In addition, it is annotated for emotion detection and abuse detection, and annotated at the word level. This article focuses in particular on Algeria's socio-linguistic situation and the effect of social media networks. Furthermore, the general guidelines for the design of DZDC12 corpus are described as well as the dialects clustering over the map.
引用
收藏
页码:419 / 455
页数:37
相关论文
共 3 条
  • [1] DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus
    Kheireddine Abainia
    Language Resources and Evaluation, 2020, 54 : 419 - 455
  • [2] An Algerian Arabic-French Code-Switched Corpus
    Cotterell, Ryan
    Renduchintala, Adithya
    Saphra, Naomi
    Callison-Burch, Chris
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [3] Machine Translation on a Parallel Code-Switched Corpus
    Menacer, M. A.
    Langlois, D.
    Jouvet, D.
    Fohr, D.
    Mella, O.
    Smaili, K.
    ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11489 : 426 - 432