Modified lesk algorithm for word sense disambiguation in Bengali

被引:0
作者
Das, Ratul [1 ]
Pal, Alok Ranjan [2 ]
Saha, Diganta [1 ]
机构
[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata, West Bengal, India
[2] Coll Engn & Management, Dept Comp Sci & Engn, Kolaghat, West Bengal, India
来源
SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES | 2024年 / 49卷 / 02期
关键词
Word sense disambiguation; Lesk algorithm; fastText; Indo WordNet; semantic similarity measure; word embeddings;
D O I
10.1007/s12046-024-02495-y
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
This article presents a novel approach towards solving the problem of Word Sense Disambiguation (WSD) for Bengali Text. The algorithm used in this work is a modification of Lesk Algorithm. In the original algorithm, the overlap between the "context bag" and the "sense bag" items from the lexical resource (WordNet) are calculated using word pair matching. In the current approach the overlap is calculated by adopting semantic similarity measure using the fastText subword embeddings. The approach can efficiently handle unknown wordforms and discover the latent semantics of words. Significant progress has been made in WSD for English and other European Languages. Indian languages like Bengali still pose a formidable challenge. The dataset used for the work is individual sentences from the Bengali Wikipedia which is a huge collection of Bengali text ( 96 K Webpages with 1700 K sentences), the Indo WordNet for Bengali language and Bengali Online Dictionary. The results of the experiments performed are promising. The target words which have semantically distinct synsets in the WordNet give a high F1 score. The F1 score achieved is 80% which is well over the baseline and shows significant improvement over the other knowledge-based approaches tried on low resource Indian languages.
引用
收藏
页数:12
相关论文
共 34 条
  • [1] BabelNet.org, About Us
  • [2] Banerjee S., 2002, Computational Linguistics and Intelligent Text Processing. Third International Conference, CICLing 2002. Proceedings (Lecture Notes in Computer Science Vol.2276), P136
  • [3] Bangla Dictionary, About Us
  • [4] Basile Pierpaolo, 2014, P 25 INT C COMP LING, P1591
  • [5] Bengali Wikipedia, 2020, About Us
  • [6] Bhattacharya Samit, 2005, PROC NATL C COMPUTER, P34
  • [7] Bhingardive Sudha, 2017, The WordNet in Indian Languages, P243, DOI DOI 10.1007/978-981-10-1909-8_15
  • [8] Bojanowski P., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI 10.1162/tacl_a_00051
  • [9] Chen Xinxiong., 2014, P 2014 C EMP METH NA, P1025, DOI 10.3115/v1/D14-1110
  • [10] Dongsuk O, 2018, P 27 INT C COMP LING, P2704