Supporting Undotted Arabic with Pre-trained Language Models

被引:0
作者
Rom, Aviad [1 ]
Bar, Kfir [1 ]
机构
[1] Reichman Univ, Data Sci Inst, Herzliyya, Israel
来源
PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE AND SPEECH PROCESSING, ICNLSP 2021 | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by finetuning pre-trained language models, which have been recently employed by many natural-language-processing applications. In this work we study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts. We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing downstream tasks. The results are encouraging; in one of the tasks our method shows nearly perfect performance.
引用
收藏
页码:89 / 94
页数:6
相关论文
共 17 条
[1]  
Abdul-Mageed M, 2021, Arxiv, DOI arXiv:2101.01785
[2]  
Abdul-Mageed Muhammad, P 6 AR NAT LANG PROC, P244
[3]  
[Anonymous], P 2020 C EMP METH NA, P4727
[4]  
Antoun W, 2020, P 4 WORKSH OP SOURC, P9
[5]  
Benajiba Y, 2007, LECT NOTES COMPUT SC, V4394, P143
[6]  
Daniels Peter T., 2014, The Type and Spread of Ara- bic Script, P25
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Drissner Gerald, 2021, Social media & palestine: Dotless Arabic outsmarts algorithms
[9]  
Farha I.A., 2021, P 6 AR NAT LANG PROC, P296
[10]  
Goodfellow I.J., 2014, ABS14126572 CORR, DOI DOI 10.48550/ARXIV.1412.6572