Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji

被引:10
作者
Hamarashid, Hozan K. [1 ]
Saeed, Soran A. [2 ]
Rashid, Tarik A. [3 ]
机构
[1] Sulaimani Polytech Univ, Comp Sci Inst, Dept Informat Technol, Sulaimani, Krg, Iraq
[2] Sulaimani Polytech Univ, Sulaimani, Krg, Iraq
[3] Univ Kurdistan Hewler, Sch Sci & Engn, Comp Sci & Engn Dept, Erbil, Krg, Iraq
关键词
Next word prediction; Kurdish language; N-gram; Corpus;
D O I
10.1007/s00521-020-05245-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Next word prediction is an input technology that simplifies the process of typing by suggesting the next word to a user to select, as typing in a conversation consumes time. A few previous studies have focused on the Kurdish language, including the use of next word prediction. However, the lack of a Kurdish text corpus presents a challenge. Moreover, the lack of a sufficient number of N-grams for the Kurdish language, for instance, five-grams, is the reason for the rare use of next Kurdish word prediction. Furthermore, the improper display of several Kurdish letters in the RStudio software is another problem. This paper provides a Kurdish corpus, creates five, and presents a unique research work on next word prediction for Kurdish Sorani and Kurmanji. The N-gram model has been used for next word prediction to reduce the amount of time while typing in the Kurdish language. In addition, little work has been conducted on next Kurdish word prediction; thus, the N-gram model is utilized to suggest text accurately. To do so, R programming and RStudio are used to build the application. The model is 96.3% accurate.
引用
收藏
页码:4547 / 4566
页数:20
相关论文
共 23 条
  • [1] [Anonymous], 1998, PROC BROADCAST NEWS
  • [2] Dumbali J, 2019, INT J INNOV TECHNOL, V8, P2278
  • [3] Towards Kurdish information retrieval
    Esmaili, Kyumars Sheykh
    Salavati, Shahin
    Datta, Anwitaman
    [J]. ACM Transactions on Asian Language Information Processing, 2014, 13 (02):
  • [4] Esmaili K.S, 2013, P 51 ANN M ASS COMPU, P300
  • [5] Franz Alex., 2006, All our N-gram are belong to you
  • [6] Haque M, 2015, INT J FOUND COMPUT S, DOI [10.5121/ijfcst.2015.5607, DOI 10.5121/IJFCST.2015.5607]
  • [7] Heafield Kenneth, 2013, ACL, P690
  • [8] Hernandez SDavid., 2014, CoNLL Shared Task, P53, DOI 10.3115/v1/W14-1707
  • [9] Jurafsky D, 2018, SPEECH LANGUAGE PROC, V3, P30
  • [10] Kumar A, 2017, LANGUAGE IDENTIFICAT