Building a Tunisian Dialect into Arabic Language Parallel Corpus for a Phrase-based Machine Translation

被引:0
作者
Sghaier, Mohamed Ali [1 ,2 ]
Zrigui, Mounir [1 ]
机构
[1] Univ Monastir, Fac Sci Monastir, Dept Comp Sci, Algebra Numbers Theory & Nonlinear Anal Lab LATNA, Monastir, Tunisia
[2] Univ Sousse, Hammam Sousse, Higher Inst Comp Sci & Commun Tech, Sousse, Tunisia
来源
VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE | 2019年
关键词
Natural Language Processing; Machine Translation; Statistical Approach; Tunisian Dialect; Modern Standard Arabic;
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
The purpose of this paper is to build a system capable of translating the dialect of Tunisia's capital into the Modern Standard Arabic. Having such a tool can have an impact in various domains such as translating social network user interactions, subtitles of Tunisian movies, books written with the writers' local dialect, etc. Since the Tunisian dialect is classified as a low-resource dialect as well as the other Arabic dialects. We started by building a parallel corpus that contains 5000 sentences. These latter are extracted from different sources accessible on the web and specifically from Facebook and YouTube. Then, they were manually translated into the Arabic. Afterwards, this resource was used to adopt the statistical approach, which is based on the creation of the Language Model ( LM) and the Translation Model (TM). These two models are then used by the decoder to choose the best translation for an input sentence. The results were promising where we achieved 49.90 as a BLEU score.
引用
收藏
页码:2910 / 2921
页数:12
相关论文
共 25 条
  • [1] [Anonymous], 2012, P COLING 2012 POST
  • [2] [Anonymous], 2013, P 51 ANN M ASS COMP
  • [3] Bacha K., 2012, KEOD, P347
  • [4] Banerjee S., 2005, P ACL WORKSH INTR EX, P65
  • [5] Ben Mohamed MA, 2015, INT ARAB J INF TECHN, V12, P566
  • [6] Bertoldi Nicola, 2009, PRAGUE B MATH LINGUI, P1
  • [7] Bouamor H, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P3387
  • [8] Bouamor H, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P1240
  • [9] Doddington G., 2002, P 2 INT C HUMAN LANG, P138
  • [10] Dyer Chris., 2013, P HUM LANG TECHN C N, P644