Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques

被引:0
作者
Himdi, Hanen [1 ]
Shaalan, Khaled [2 ]
机构
[1] Univ Jeddah, Coll Comp Sci & Engn, Comp Sci & Artificial Intelligence Dept, Jeddah 21955, Saudi Arabia
[2] British Univ Dubai, Fac Engn & IT, DIAC Block 11,POB 345015, Dubai, U Arab Emirates
关键词
natural language processing (NLP); deep learning; text mining; BERT; textual analysis; transformers-based models; DIALECT;
D O I
10.3390/info15120779
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Author Gender Identification (AGI) is an extensively studied subject owing to its significance in several domains, such as security and marketing. Recognizing an author's gender may assist marketers in segmenting consumers more effectively and crafting tailored content that aligns with a gender's preferences. Also, in cybersecurity, identifying an author's gender might aid in detecting phishing attempts where hackers could imitate individuals of a specific gender. Although studies in Arabic have mostly concentrated on written dialects, such as tweets, there is a paucity of studies addressing Modern Standard Arabic (MSA) in journalistic genres. To address the AGI issue, this work combines the beneficial properties of natural language processing with cutting-edge deep learning methods. Firstly, we propose a large 8k MSA article dataset composed of various columns sourced from news platforms, labeled with each author's gender. Moreover, we extract and analyze textual features that may be beneficial in identifying gender-related cues through their writings, focusing on semantics and syntax linguistics. Furthermore, we probe several innovative deep learning models, namely, Convolutional Neural Networks (CNNs), LSTM, Bidirectional LSTM (BiLSTM), and Bidirectional Encoder Representations from Transformers (BERT). Beyond that, a novel enhanced BERT model is proposed by incorporating gender-specific textual features. Through various experiments, the results underscore the potential of both BERT and the textual features, resulting in a 91% accuracy for the enhanced BERT model and a range of accuracy from 80% to 90% accuracy for deep learning models. We also employ these features for AGI in informal, dialectal text, with the enhanced BERT model reaching 68.7% accuracy. This demonstrates that these gender-specific textual features are conducive to AGI across MSA and dialectal texts.
引用
收藏
页数:21
相关论文
共 45 条
  • [1] Abdelali Ahmed, 2016, P 2016 C N AM CHAPT, P11, DOI 10.18653/v1/N16-3003
  • [2] A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels
    Alluhaibi, Reyadh
    Alfraidi, Tareq
    Abdeen, Mohammad A. R.
    Yatimi, Ahmed
    [J]. INFORMATION, 2021, 12 (12)
  • [3] Author gender identification from Arabic text
    Alsmearat, Kholoud
    Al-Ayyoub, Mahmoud
    Al-Shalabi, Riyad
    Kanaan, Ghassan
    [J]. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2017, 35 : 85 - 95
  • [4] A Transformer-Based Approach to Authorship Attribution in Classical Arabic Texts
    AlZahrani, Fetoun Mansour
    Al-Yahya, Maha
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (12):
  • [5] [Anonymous], 2010, Proceedings of the 2Nd International Workshop on Search and Mining User-generated Contents, SMUC '10
  • [6] Automatically Profiling the Author of an Anonymous Text
    Argamon, Shlomo
    Koppel, Moshe
    Pennebarker, James W.
    Schler, Jonathan
    [J]. COMMUNICATIONS OF THE ACM, 2009, 52 (02) : 119 - 123
  • [7] Ayeni A., 2014, Empirics of Standard Deviation
  • [8] Balamurugan K., 2018, Stud. Linguist. Lit, V2, P110, DOI [10.22158/sll.v2n2p110, DOI 10.22158/SLL.V2N2P110]
  • [9] Gender identity and lexical variation in social media
    Bamman, David
    Eisenstein, Jacob
    Schnoebelen, Tyler
    [J]. JOURNAL OF SOCIOLINGUISTICS, 2014, 18 (02) : 135 - 160
  • [10] Author gender identification from text
    Cheng, Na
    Chandramouli, R.
    Subbalakshmi, K. P.
    [J]. DIGITAL INVESTIGATION, 2011, 8 (01) : 78 - 88