Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

被引:1
|
作者
Warjri, Sunita [1 ]
Pakray, Partha [2 ]
Lyngdoh, Saralin A. [3 ]
Maji, Arnab Kumar [1 ]
机构
[1] North Eastern Hill Univ, Dept Informat Technol, PO NEHU, Shillong 793022, Meghalaya, India
[2] Natl Inst Technol, Dept Comp Sci & Engn, Silchar 788010, Assam, India
[3] North Eastern Hill Univ, Dept Linguist, PO NEHU, Shillong 793022, Meghalaya, India
关键词
Deep learning; BiLSTM; word embedding; POS tagger; ambiguity; khasi language; khasi corpus; AGREEMENT; LSTM;
D O I
10.1145/3488381
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora
    Sunita Warjri
    Partha Pakray
    Saralin A. Lyngdoh
    Arnab Kumar Maji
    International Journal of Speech Technology, 2021, 24 : 853 - 864
  • [2] Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora
    Warjri, Sunita
    Pakray, Partha
    Lyngdoh, Saralin A.
    Maji, Arnab Kumar
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (04) : 853 - 864
  • [3] A Deep Learning-Based Approach for Part of Speech (PoS) Tagging in the Pashto Language
    Ullah, Shaheen
    Ahmad, Riaz
    Namoun, Abdallah
    Muhammad, Siraj
    Ullah, Khalil
    Hussain, Ibrar
    Ibrahim, Isa Ali
    IEEE ACCESS, 2024, 12 : 86355 - 86364
  • [4] Part-of-Speech (POS) Tagging for the Nyishi Language
    Siram, Joyir
    Sambyo, Koj
    Sarkar, Achyuth
    ADVANCES IN INFORMATION COMMUNICATION TECHNOLOGY AND COMPUTING, AICTC 2021, 2022, 392 : 191 - 199
  • [5] A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging
    Maulud, Dastan
    Jacksi, Karwan
    Ali, Ismael
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2023, 38 (04) : 1604 - 1612
  • [6] Corpus based part-of-speech tagging
    Lv, Chengyao
    Liu, Huihua
    Dong, Yuanxing
    Chen, Yunliang
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 647 - 654
  • [7] Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
    Dalai, Tusarkanta
    Mishra, Tapas Kumar
    Sa, Pankaj K.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [8] Part-of-Speech (POS) Tagging for Standard Brunei Malay: A Probabilistic and Neural- Based Approach
    Mohaimin, Izzati
    Apong, Rosyzie A.
    Damit, Ashrol R.
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2023, 14 (04) : 830 - 837
  • [9] Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial Persian Part of Speech Tagging
    Rabiei, Leyla
    Rahmani, Farzaneh
    Khansari, Mohammad
    Rajabi, Zeinab
    Salimi, Moein
    arXiv, 2023,
  • [10] Deep Learning Based Unsupervised POS Tagging for Sanskrit
    Srivastava, Prakhar
    Chauhan, Kushal
    Aggarwal, Deepanshu
    Shukla, Anupam
    Dhar, Joydip
    Jain, Vrashabh Prasad
    2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,