A Study of the Effects of Stemming Strategies on Arabic Document Classification

被引：39

作者：

Alhaj, Yousif A. ^{[1
]}

Xiang, Jianwen ^{[1
]}

Zhao, Dongdong ^{[1
]}

Al-Qaness, Mohammed A. A. ^{[2
]}

Abd Elaziz, Mohamed ^{[3
]}

Dahou, Abdelghani ^{[1
]}

机构：

[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Hubei Key Lab Transportat Internet Things, Wuhan 430070, Hubei, Peoples R China

[2] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China

[3] Zagazig Univ, Fac Sci, Dept Math, Zagazig 44519, Egypt

来源：

IEEE ACCESS | 2019年 / 7卷

关键词：

Arabic text classification; text preprocessing; stemming techniques; feature extraction; feature selection; TEXT CLASSIFICATION;

D O I：

10.1109/ACCESS.2019.2903331

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stemming is one of the most effective techniques, which has been adopted in many applications, such as machine learning, machine translation, document classification (DC), information retrieval, and natural language processing. The stemming technique is meant to be applied during the classification of documents to reduce the high dimensionality of the feature space, which, in turn, raises the functioning of the classification system, particularly with extreme modulated language, for instance, Arabic language. This paper aims to study the impact of stemming techniques, namely Information Science Research Institute (ISRI), Tashaphyne, and ARLStem on Arabic DC. The classification algorithms, namely Naive Bayesian (NB), support vector machine (SVM), and K-nearest neighbors (KNN), are used in this paper. In addition, the chi-square feature selection is used to select the most relevant features. Experiments are conducted on CNN Arabic corpus, which is collected from Arabic websites to assess the performance of the classification system. In order to evaluate the classifiers, the K-fold cross-validation method and Micro-F1 are used. Findings of this paper indicate that the ARLStem outperforms the ISRI and Tashaphyne stemmers. The outcomes clearly showed the effectiveness of the SVM over the KNN and NB classifiers, which achieved 94.64% Micro-F1 value when using the ARLStem stemmer.

引用

页码：32664 / 32671

页数：8

共 46 条

[1]

Ababneh Jafar., 2014, International Journal of Computer Trends and Technology, V7, P219

[2] A novel robust Arabic light stemmer [J].

Abainia, Kheireddine ;

Ouamour, Siham ;

Sayoud, Halim .

JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2017, 29 (03) :557-573

[3] Intelligent classification of web pages using contextual and visual features [J].

Ahmadi, Ali ;

Fotouhi, Mehran ;

Khaleghi, Mahmoud .

APPLIED SOFT COMPUTING, 2011, 11 (02) :1638-1647

[4]

Al-Harbi S., 2008, P 9 INT C STAT AN TE, V8, P77

[5] Content-based analysis to detect Arabic web spam [J].

Al-Kabi, Mohammed ;

Wahsheh, Heider ;

Alsmadi, Izzat ;

Al-Shawakfa, Emad ;

Wahbeh, Abdullah ;

Al-Hmoud, Ahmed .

JOURNAL OF INFORMATION SCIENCE, 2012, 38 (03) :284-296

[6]

Al-Shargabi B., 2011, P 2011 INT C INT SEM, P11

[7]

Alkalimat Abdul., 2013, African American Studies 2013: A National Web-Based Survey, P1

[8]

[Anonymous], 2013, P 6 INT JOINT C NATU

[9]

[Anonymous], 2016, Int J Comput Appl, DOI [DOI 10.5120/IJCA2016908328, 10.5120/ijca2016908328]

[10] The Effect of Preprocessing on Arabic Document Categorization [J].

Ayedh, Abdullah ;

Tan, Guanzheng ;

Alwesabi, Khaled ;

Rajeh, Hamdi .

ALGORITHMS, 2016, 9 (02)

← 1 2 3 4 5 →