A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features

被引:0
作者
Ali, Mushtaq [1 ]
Khan, Muzammil [1 ]
Alharbi, Yasser [2 ]
机构
[1] Univ Swat, Dept Comp & Software Technol, Swat, KP, Pakistan
[2] Univ Hail, Coll Comp Sci & Engn, Hail, Saudi Arabia
关键词
Part of speech; Urdu corpus; Conditional random field; Natural language processing; Tagging; Classification;
D O I
10.7717/peerj-cs.2577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Part-of-speech (POS) tagging is the process of assigning tags or labels to each word of a text based on the grammatical category. It provides the ability to understand the grammatical structure of a text and plays an important role in many natural language processing tasks like syntax understanding, semantic analysis, text processing, information retrieval, machine translation, and named entity recognition. The POS tagging involves sequential nature, context dependency, and labeling of each word. Therefore it is a sequence labeling task. The challenges faced in Urdu text processing including resource scarcity, morphological richness, free word order, absence of capitalization, agglutinative nature, spelling variations, and multipurpose usage of words raise the demand for the development of machine learning automatic POS tagging systems for Urdu. Therefore, a conditional random field (CRF) based supervised POS classifier has been developed for 33 different Urdu POS categories using the language-independent features of Urdu text for the Urdu news dataset MM-POST containing 119,276 tokens of seven different domains including Entertainment, Finance, General, Health, Politics, Science and Sports. An analysis of the proposed approach is presented, proving it superior to other Urdu POS tagging research for using a simpler strategy by employing fewer word-level features as context windows together with the word length. The effective utilization of these features for the POS tagging of Urdu text resulted in the state-of-the-art performance of the CRF model, achieving an overall classification accuracy of 96.1%.
引用
收藏
页数:31
相关论文
共 23 条
[1]  
Adeeba F, 2016, C LANG TECHN
[2]  
Ahmed T, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2920
[3]  
Ali M, 2024, 1 INT C EM TRENDS CH
[4]  
Ali M, 2024, MM-post dataset. Freely available for research and academic purposes
[5]  
[Anonymous], Human Rights Council Working Group on the Universal Periodic Review Fourteenth session Geneva, 22 October-5 November 2012 'National report submitted in accordance with paragraph 5 of the annex to Human Rights Council resolution 16/21 Ghana' A/HRC/WG.6/14/GHA/1 10 August 2012 available at https://documents-ddsny.un.org/doc/UNDOC/GEN/G12/158/78/PDF/G1215878.pdf?OpenElement accessed 10 July 2020.
[6]  
Anwar W, 2007, PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, P3418
[7]   Developing a POS Tagged Corpus of Urdu Tweets [J].
Baig, Amber ;
Rahman, Mutee U. ;
Kazi, Hameedullah ;
Baloch, Ahsanullah .
COMPUTERS, 2020, 9 (04) :1-13
[8]  
Bhat R A., 2017, Handbook of Linguistic Annotation
[9]  
Center for Language Engineering, 2023, Part of speech tagger
[10]   Urdu language processing: a survey [J].
Daud, Ali ;
Khan, Wahab ;
Che, Dunren .
ARTIFICIAL INTELLIGENCE REVIEW, 2017, 47 (03) :279-311