Author gender identification from Arabic text

被引:28
作者
Alsmearat, Kholoud [1 ]
Al-Ayyoub, Mahmoud [1 ]
Al-Shalabi, Riyad [2 ]
Kanaan, Ghassan [2 ]
机构
[1] Jordan Univ Sci & Technol, Irbid, Jordan
[2] Amman Arab Univ, Amman, Jordan
关键词
Arabic text processing; Gender identification; Stylometric features; Bag-Of-Words; ATTRIBUTION;
D O I
10.1016/j.jisa.2017.06.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Gender Identification (GI) problem is concerned with determining the gender of a given text's author. It has a wide range of academic/commercial applications in various fields including literature, security, forensics, electronic markets and trading, etc. To address this problem, researchers have proposed that the writing styles of authors of the same gender share certain aspects, which can be captured by certain stylometric features (SF). Another approach to address this problem focuses mainly on keywords occurrences in each document. This is known as the Bag-Of-Words (BOW) approach. In this work, we study and compare both approaches and focus on the Arabic language for which this problem is still largely understudied despite its importance. To the best of our knowledge, no previous work has considered these approaches for the GI problem of Arabic text. The comparison is carried out under different settings and the results show that the SF approach, which is much cheaper to train, can generate more accurate results under most settings. In fact, the best accuracy levels obtained by the SF and BOW approaches on our in-house dataset are 80.4% and 73.9%, respectively. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:85 / 95
页数:11
相关论文
共 74 条
[1]  
Abbasi A, 2005, LECT NOTES COMPUT SC, V3495, P183
[2]   Applying authorship analysis to extremist-group web forum messages [J].
Abbasi, A ;
Chen, HC .
IEEE INTELLIGENT SYSTEMS, 2005, 20 (05) :67-75
[3]   An extensive study of authorship authentication of Arabic articles [J].
Al-Ayyoub, Mahmoud ;
Alwajeeh, Ahmed ;
Hmeidi, Ismail .
International Journal of Web Information Systems, 2017, 13 (01) :85-104
[4]   Using Big Data Analytics For Authorship Authentication of Arabic Tweets [J].
Albadarneh, Jafar ;
Talafha, Bashar ;
Al-Ayyoub, Mahmoud ;
Zaqaibeh, Belal ;
Al-Smadi, Mohammad ;
Jararweh, Yaser ;
Benkhelifa, Elhadj .
2015 IEEE/ACM 8TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC), 2015, :448-452
[5]  
Alsmearat K, 2015, I C COMP SYST APPLIC
[6]  
Alsmearat K, 2014, I C COMP SYST APPLIC, P601, DOI 10.1109/AICCSA.2014.7073254
[7]   Naive Bayes classifiers for authorship attribution of Arabic texts [J].
Altheneyan, Alaa Saleh ;
Menai, Mohamed El Bachir .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2014, 26 (04) :473-484
[8]  
[Anonymous], 2006, TEXT MINING HDB ADV
[9]  
[Anonymous], 2010, The impact of text preprocessing and term weighting on arabic text classification
[10]  
[Anonymous], 6 WORKSH MAK SENS MI