Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language

被引:1
作者
Nazir, Shahzad [1 ]
Asif, Muhammad [1 ]
Rehman, Mariam [2 ]
Ahmad, Shahbaz [1 ]
机构
[1] Natl Text Univ, Dept Comp Sci, Faisalabad, Pakistan
[2] Govt Coll Univ, Dept Informat Technol, Faisalabad, Faisalabad, Pakistan
关键词
Word segmentation; Text normalization; Machine learning; Low resourced languages;
D O I
10.7717/peerj-cs.1704
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world's 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement.
引用
收藏
页数:19
相关论文
共 38 条
  • [1] Abbas SZ, 2022, Arxiv, DOI arXiv:2206.11862
  • [2] Afraz F, 2012, PhD thesis
  • [3] Akram M., 2010, P 8 WORKSH AS LANG R, P88
  • [4] Allahyari M, 2017, Arxiv, DOI arXiv:1707.02268
  • [5] Baron A., 2008, P POSTGR C CORP LING
  • [6] Bollmann M, 2019, Arxiv, DOI arXiv:1904.02036
  • [7] Text normalization in social media: progress, problems and applications for a pre-processing system of casual English
    Clark, Eleanor
    Araki, Kenji
    [J]. COMPUTATIONAL LINGUISTICS AND RELATED FIELDS, 2011, 27 : 2 - 11
  • [8] Urdu language processing: a survey
    Daud, Ali
    Khan, Wahab
    Che, Dunren
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2017, 47 (03) : 279 - 311
  • [9] Durrani N., 2010, HUMAN LANGUAGE TECHN, P528
  • [10] Garcia S., 2016, BIG DATA ANAL, V1, P9, DOI [10.1186/s41044-016-0014-0, DOI 10.1186/S41044-016-0014-0]