Supporting Undotted Arabic with Pre-trained Language Models

被引：0

作者：

Rom, Aviad ^{[1
]}

Bar, Kfir ^{[1
]}

机构：

[1] Reichman Univ, Data Sci Inst, Herzliyya, Israel

来源：

PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE AND SPEECH PROCESSING, ICNLSP 2021 | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by finetuning pre-trained language models, which have been recently employed by many natural-language-processing applications. In this work we study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts. We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing downstream tasks. The results are encouraging; in one of the tasks our method shows nearly perfect performance.

引用

页码：89 / 94

页数：6

共 17 条

[1]

Abdul-Mageed M, 2021, Arxiv, DOI arXiv:2101.01785

[2]

Abdul-Mageed Muhammad, P 6 AR NAT LANG PROC, P244

[3]

[Anonymous], P 2020 C EMP METH NA, P4727

[4]

Antoun W, 2020, P 4 WORKSH OP SOURC, P9

[5]

Benajiba Y, 2007, LECT NOTES COMPUT SC, V4394, P143

[6]

Daniels Peter T., 2014, The Type and Spread of Ara- bic Script, P25

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

Drissner Gerald, 2021, Social media & palestine: Dotless Arabic outsmarts algorithms

[9]

Farha I.A., 2021, P 6 AR NAT LANG PROC, P296

[10]

Goodfellow I.J., 2014, ABS14126572 CORR, DOI DOI 10.48550/ARXIV.1412.6572

← 1 2 →