Evaluating Various Tokenizers for Arabic Text Classification

被引：14

作者：

Alyafeai, Zaid ^{[1
]}

Al-shaibani, Maged S. ^{[1
]}

Ghaleb, Mustafa ^{[2
]}

Ahmad, Irfan ^{[1
,3
]}

机构：

[1] King Fahd Univ Petr & Minerals, Dept Comp Sci, Dhahran 31261, Saudi Arabia

[2] King Fahd Univ Petr & Minerals, Interdisciplinary Res Ctr Intelligent Secure Syst, Dhahran 31261, Saudi Arabia

[3] King Fahd Univ Petr & Minerals, SDAIA KFUPM Joint Res Ctr Artificial Intelligence, Dhahran 31261, Saudi Arabia

来源：

NEURAL PROCESSING LETTERS | 2023年 / 55卷 / 03期

关键词：

Text Tokenization; Arabic NLP; Text Classification; Sentiment Analysis; Poem-meter Classification;

D O I：

10.1007/s11063-022-10990-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords, which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic, i.e., they do not incorporate the linguistic features of a given language. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to other three popular tokenizers using unsupervised evaluations. In addition, we compare all the six tokenizers by evaluating them on three supervised classification tasks: sentiment analysis, news classification and poem-meter classification, using six publicly available datasets. Our experiments show that none of the tokenization techniques is the best choice overall and that the performance of a given tokenization algorithm depends on many factors including the size of the dataset, nature of the task, and the morphology richness of the dataset. However, some tokenization techniques are better overall as compared to others on various text classification tasks.

引用

页码：2911 / 2933

页数：23

共 63 条

[1] ACCURATE AND FAST RECURRENT NEURAL NETWORK SOLUTION FOR THE AUTOMATIC DIACRITIZATION OF ARABIC TEXT [J].

Abandah, Gheith ;

Abdel-Karim, Asma .

JORDANIAN JOURNAL OF COMPUTERS AND INFORMATION TECHNOLOGY, 2020, 6 (02) :103-121

[2] Classifying and diacritizing Arabic poems using deep recurrent neural networks [J].

Abandah, Gheith A. ;

Khedher, Mohammed Z. ;

Abdel-Majeed, Mohammad R. ;

Mansour, Hamdi M. ;

Hulliel, Salma F. ;

Bisharat, Lara M. .

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (06) :3775-3788

[3]

Abdelali Ahmed., 2016, P 2016 C N AM CHAPTE, P11, DOI [DOI 10.18653/V1/N16, 10.18653/v1/N16-3003, DOI 10.18653/V1/N16-3003]

[4] A comparative study of effective approaches for Arabic sentiment analysis [J].

Abu Farha, Ibrahim ;

Magdy, Walid .

INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (02)

[5]

Ahmed A.A.A., 2021, ARXIV

[6] A comprehensive survey of arabic sentiment analysis [J].

Al-Ayyoub, Mahmoud ;

Khamaiseh, Abed Allah ;

Jararweh, Yaser ;

Al-Kabi, Mohammed N. .

INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (02) :320-342

[7] Arabic Online Handwriting Recognition (AOHR): A Survey [J].

Al-Helali, Baligh M. ;

Mahmoud, Sabri A. .

ACM COMPUTING SURVEYS, 2017, 50 (03)

[8]

Al-Rfou R, 2019, AAAI CONF ARTIF INTE, P3159

[9] MetRec: A dataset for meter classification of arabic poetry [J].

Al-Shaibani, Maged S. ;

Alyafeai, Zaid ;

Ahmad, Irfan .

DATA IN BRIEF, 2020, 33

[10] Meter classification of Arabic poems using deep bidirectional recurrent neural networks [J].

Al-Shaibani, Maged S. ;

Alyafeai, Zaid ;

Ahmad, Irfan .

PATTERN RECOGNITION LETTERS, 2020, 136 :1-7

← 1 2 3 4 5 6 7 →