Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

被引：1

作者：

Warjri, Sunita ^{[1
]}

Pakray, Partha ^{[2
]}

Lyngdoh, Saralin A. ^{[3
]}

Maji, Arnab Kumar ^{[1
]}

机构：

[1] North Eastern Hill Univ, Dept Informat Technol, PO NEHU, Shillong 793022, Meghalaya, India

[2] Natl Inst Technol, Dept Comp Sci & Engn, Silchar 788010, Assam, India

[3] North Eastern Hill Univ, Dept Linguist, PO NEHU, Shillong 793022, Meghalaya, India

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2022年 / 21卷 / 03期

关键词：

Deep learning; BiLSTM; word embedding; POS tagger; ambiguity; khasi language; khasi corpus; AGREEMENT; LSTM;

D O I：

10.1145/3488381

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

引用

页数：24

共 50 条

[1] Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora
Sunita Warjri
Partha Pakray
Saralin A. Lyngdoh
Arnab Kumar Maji
International Journal of Speech Technology, 2021, 24 : 853 - 864
[2] Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora
Warjri, Sunita
Pakray, Partha
Lyngdoh, Saralin A.
Maji, Arnab Kumar
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (04) : 853 - 864
[3] A Deep Learning-Based Approach for Part of Speech (PoS) Tagging in the Pashto Language
Ullah, Shaheen
Ahmad, Riaz
Namoun, Abdallah
Muhammad, Siraj
Ullah, Khalil
Hussain, Ibrar
Ibrahim, Isa Ali
IEEE ACCESS, 2024, 12 : 86355 - 86364
[4] Part-of-Speech (POS) Tagging for the Nyishi Language
Siram, Joyir
Sambyo, Koj
Sarkar, Achyuth
ADVANCES IN INFORMATION COMMUNICATION TECHNOLOGY AND COMPUTING, AICTC 2021, 2022, 392 : 191 - 199
[5] A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging
Maulud, Dastan
Jacksi, Karwan
Ali, Ismael
DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2023, 38 (04) : 1604 - 1612
[6] Corpus based part-of-speech tagging
Lv, Chengyao
Liu, Huihua
Dong, Yuanxing
Chen, Yunliang
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 647 - 654
[7] Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
Dalai, Tusarkanta
Mishra, Tapas Kumar
Sa, Pankaj K.
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
[8] Part-of-Speech (POS) Tagging for Standard Brunei Malay: A Probabilistic and Neural- Based Approach
Mohaimin, Izzati
Apong, Rosyzie A.
Damit, Ashrol R.
JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2023, 14 (04) : 830 - 837
[9] Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial Persian Part of Speech Tagging
Rabiei, Leyla
Rahmani, Farzaneh
Khansari, Mohammad
Rajabi, Zeinab
Salimi, Moein
arXiv, 2023,
[10] Deep Learning Based Unsupervised POS Tagging for Sanskrit
Srivastava, Prakhar
Chauhan, Kushal
Aggarwal, Deepanshu
Shukla, Anupam
Dhar, Joydip
Jain, Vrashabh Prasad
2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,

← 1 2 3 4 5 →