Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets

被引:3
作者
Zoya [1 ]
Latif, Seemab [1 ]
Latif, Rabia [2 ]
Majeed, Hammad [3 ]
Jamail, Nor Shahida Mohd [2 ]
机构
[1] Natl Univ Sci & Technol NUST, SEECS, H-12, Islamabad 44000, Pakistan
[2] Prince Sultan Univ, CCIS, Riyadh 11586, Saudi Arabia
[3] FAST Natl Univ Comp & Emerging Sci, H-11-4, Islamabad 44000, Pakistan
关键词
Urdu language processing; Urdu tweets; Urdu text tokenization; Urdu language processing tools; outlier detection and removal; SENTIMENT ANALYSIS; TESTS;
D O I
10.1145/3622939
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text pre-processing is a crucial step in Natural Language Processing (NLP) applications, particularly for handling informal and noisy content on social media. Word-level tokenization plays a vital role in text preprocessing by removing stop words, filtering irrelevant characters, and retaining relevant tokens. These tokens are essential for constructing meaningful n-grams within advanced NLP frameworks used for data modeling. However, tokenization in low-resource languages like Urdu presents challenges due to language complexity and limited resources. Conventional space-based methods and direct application of language-specific tools often result in erroneous tokens in Urdu Language Processing (ULP). This hinders language models from effectively learning language-specific and domain-specific tokens, leading to sub-optimal results for downstream tasks such as aspect mining, topic modeling, and Named Entity Recognition (NER). To address this issue for Urdu, we have proposed a data pre-processing technique that detects outliers using the Inter-Quartile-Range (IQR) method and proposed normalization algorithms for creating useful lexicons in conjunction with existing technologies. We have collected approximately 50 million Urdu tweets using the Twitter API and conducted the performance analysis of existing language-specific tokenizers (Urduhack and Space-based tokenizer). Dataset variants were created based on the language-specific tokenizers, and we performed statistical analysis tests and visualization techniques to compare tokenization results before and after applying the proposed outlier detection and normalization method. Our findings highlighted the noticeable improvement in token size distributions, handling of informal language tokens, and misspelled and lengthy tokens. The Urduhack tokenizer combined with the proposed outlier detection and normalization yielded tokens with the best-fitted distribution in ULP. Its effectiveness has been evaluated through the task of topic modeling using Non-negative Matrix Factorization (NMF) and Latent Dirichlet allocation (LDA). The results demonstrated new and distinct topics using unigram features while achieving highly coherent topics when utilizing bigram features. For the traditional space-based method, the results consistently demonstrated improved coherence and precision scores. However, the NMF topic modeling with bigram features outperformed LDA topic modeling with bigram features.
引用
收藏
页数:31
相关论文
共 44 条
[1]  
Abbas SZ, 2022, Arxiv, DOI arXiv:2206.11862
[2]   Exploring deep learning approaches for Urdu text classification in product manufacturing [J].
Akhter, Muhammad Pervez ;
Jiangbin, Zheng ;
Naqvi, Irfan Raza ;
Abdelmajeed, Mohammed ;
Fayyaz, Muhammad .
ENTERPRISE INFORMATION SYSTEMS, 2022, 16 (02) :223-248
[3]  
ALi Ikram, 2020, Urduhack: A Python Library for Urdu Language Processing
[4]   Threatening Language Detection and Target Identification in Urdu Tweets [J].
Amjad, Maaz ;
Ashraf, Noman ;
Zhila, Alisa ;
Sidorov, Grigori ;
Zubiaga, Arkaitz ;
Gelbukh, Alexander .
IEEE ACCESS, 2021, 9 (09) :128302-128313
[5]  
Ananiadou Sophia, 2007, P 45 ANN M ASS COMP
[6]   ASYMPTOTIC THEORY OF CERTAIN GOODNESS OF FIT CRITERIA BASED ON STOCHASTIC PROCESSES [J].
ANDERSON, TW ;
DARLING, DA .
ANNALS OF MATHEMATICAL STATISTICS, 1952, 23 (02) :193-212
[7]  
[Anonymous], 2003, International Journal of Computational Linguistics & Chinese Language Processing
[8]   Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification [J].
Asim, Muhammad Nabeel ;
Ghani, Muhammad Usman ;
Ibrahim, Muhammad Ali ;
Mahmood, Waqar ;
Dengel, Andreas ;
Ahmed, Sheraz .
NEURAL COMPUTING & APPLICATIONS, 2021, 33 (11) :5437-5469
[9]  
Bouma Gerlof, 2009, Proceedings of GSCL, P31, DOI DOI 10.1007/BF02774984
[10]   A SUGGESTION FOR USING POWERFUL AND INFORMATIVE TESTS OF NORMALITY [J].
DAGOSTINO, RB ;
BELANGER, A ;
DAGOSTINO, RB .
AMERICAN STATISTICIAN, 1990, 44 (04) :316-321