DART: A Large Dataset of Dialectal Arabic Tweets

被引:0
作者
Alsarsour, Israa [1 ]
Mohamed, Esraa [1 ]
Suwaileh, Reem [1 ]
Elsayed, Tamer [1 ]
机构
[1] Qatar Univ, Comp Sci & Engn Dept, Doha, Qatar
来源
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018) | 2018年
关键词
Arabic; Multi-Dialect; Twitter; Crowdsourcing; Annotations; Corpus;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present a new large manually-annotated multi-dialect dataset of Arabic tweets that is publicly available. The Dialectal ARabic Tweets (DART) dataset has about 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi. The paper outlines the pipeline of constructing the dataset from crawling tweets that match a list of dialect phrases to annotating the tweets by the crowd. We also touch some challenges that we face during the process. We evaluate the quality of the dataset from two perspectives: the inter-annotator agreement and the accuracy of the final labels. Results show that both measures were substantially high for the Egyptian, Gulf, and Levantine dialect groups, but lower for the Iraqi and Maghrebi dialects, which indicates the difficulty of identifying those two dialects manually and hence automatically.
引用
收藏
页码:3666 / 3670
页数:5
相关论文
共 14 条
  • [1] Al-Mannai K., 2014, P 2014 C EMP METH NA, P207
  • [2] [Anonymous], 2014, PROC 2014 C EMPIRICA
  • [3] [Anonymous], 2014, LREC 2014 9 INT C
  • [4] [Anonymous], 2011, SHORT PAPERS
  • [5] [Anonymous], 2015, P 2015 C EMP METH NA
  • [6] [Anonymous], 2012, P C N AM CHAPTER ASS
  • [7] [Anonymous], 2013, COMM SIGN PROC THEIR
  • [8] Benajiba Y., 2010, MED C CTR VAL MALT, P91
  • [9] Bouamor H, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1240
  • [10] Eldesouki M., 2017, ARXIV170805891