A dataset for evaluating Bengali word sense disambiguation techniques

被引:2
作者
Das Dawn D. [1 ]
Khan A. [2 ]
Shaikh S.H. [3 ]
Pal R.K. [1 ]
机构
[1] Department of Computer Science and Engineering, University of Calcutta, Calcutta
[2] Product Development and Diversification, ARP Engineering, Calcutta
[3] Department of Computer Science and Engineering, BML Munjal University, Kapriwas
关键词
Bengali; Corpora; Dataset; Indo word dataset; Knowledge resources; Word sense disambiguation;
D O I
10.1007/s12652-022-04471-y
中图分类号
学科分类号
摘要
The computation of natural language enables a suitable transmission through the universe by retrieving the correct sense of each word. A word may be monosemous or polysemous. The use of polysemous words in an appropriate context plays a critical role in communication. Over the last 2 decades, a significant amount of research has been done for automatically solving the correct sense of a polysemous word in the context of word sense disambiguation. A word sense disambiguation algorithm identifies the proper sense of a polysemous word by analysing the contextual data. Nevertheless, there is a gap in the contemporary literature regarding the availability of datasets in Asian languages, especially Bengali. Therefore, in this work, we have presented a dataset comprising hundred Bengali polysemous words. Each word in this dataset consists of three or four disjoint senses, and each sense comprises ten paragraphs. Each paragraph describes the sense of a particular polysemous word. We have performed statistical analysis on the basis of seven relevant and important characteristics. A general framework has also been presented for training and testing with possible guidelines for performance analysis. A baseline strategy has been introduced based on four feature sets. Finally, a set of experiments have been performed to analyse the system performance. © 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
引用
收藏
页码:4057 / 4086
页数:29
相关论文
共 65 条
[1]  
Agirre E., Martinez D., Knowledge sources for word sense disambiguation, International conference on text, speech and dialogue, pp. 1-10, (2001)
[2]  
Alian M., Awajan A., Al-Kouz A., Word sense disambiguation for arabic text using wikipedia and vector space model, Int J Speech Technol, 19, 4, pp. 857-867, (2016)
[3]  
Anirban D., Nitya B., Van Breugel L.M., Sonali S., Bhupen B., Hiranya S., Udeme-Abasi N., Ahmed M., Subhankar P., Youtube as a source of medical and epidemiological information during COVID-19 pandemic: a cross-sectional study of content across six languages around the globe, Cureus, 12, 6, (2020)
[4]  
Aoshima M., Yata K., A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data, Ann Inst Stat Math, 66, 5, pp. 983-1010, (2014)
[5]  
Ashiq W., Urdu Word Sense Disambiguation Using Siamese Neural Networks, (2021)
[6]  
Aung N.T.T., Soe K.M., Thein N.L., A word sense disambiguation system using naïve bayesian algorithm for Myanmar language, Int J Sci Eng Res, 2, 9, pp. 1-6, (2011)
[7]  
Banerjee S., Pedersen T., Et al., Extended gloss overlaps as a measure of semantic relatedness, Ijcai, 3, pp. 805-810, (2003)
[8]  
Banerjee E., Bansal A., Jha G.N., Issues in chunking parallel corpora: Mapping hindi-english verb group in ilci, Workshop Programme, (2014)
[9]  
Baruah N., Gogoi A., Sarma S.K., Borah R., Utilizing corpus statistics for assamese word sense disambiguation, Advances in computing and network communications, pp. 271-283, (2021)
[10]  
Basile P., de Gemmis M., Lops P., Semeraro G., Combining knowledge-based methods and supervised learning for effective Italian word sense disambiguation, In: Proceedings of the 2008 Conference on Semantics in Text Processing., pp. 5-16, (2008)