A dataset for evaluating Bengali word sense disambiguation techniques

被引：2

作者：

Das Dawn D. ^{[1
]}

Khan A. ^{[2
]}

Shaikh S.H. ^{[3
]}

Pal R.K. ^{[1
]}

机构：

[1] Department of Computer Science and Engineering, University of Calcutta, Calcutta

[2] Product Development and Diversification, ARP Engineering, Calcutta

[3] Department of Computer Science and Engineering, BML Munjal University, Kapriwas

来源：

Journal of Ambient Intelligence and Humanized Computing | 2023年 / 14卷 / 04期

关键词：

Bengali; Corpora; Dataset; Indo word dataset; Knowledge resources; Word sense disambiguation;

D O I：

10.1007/s12652-022-04471-y

中图分类号：

学科分类号：

摘要：

The computation of natural language enables a suitable transmission through the universe by retrieving the correct sense of each word. A word may be monosemous or polysemous. The use of polysemous words in an appropriate context plays a critical role in communication. Over the last 2 decades, a significant amount of research has been done for automatically solving the correct sense of a polysemous word in the context of word sense disambiguation. A word sense disambiguation algorithm identifies the proper sense of a polysemous word by analysing the contextual data. Nevertheless, there is a gap in the contemporary literature regarding the availability of datasets in Asian languages, especially Bengali. Therefore, in this work, we have presented a dataset comprising hundred Bengali polysemous words. Each word in this dataset consists of three or four disjoint senses, and each sense comprises ten paragraphs. Each paragraph describes the sense of a particular polysemous word. We have performed statistical analysis on the basis of seven relevant and important characteristics. A general framework has also been presented for training and testing with possible guidelines for performance analysis. A baseline strategy has been introduced based on four feature sets. Finally, a set of experiments have been performed to analyse the system performance. © 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.

引用

页码：4057 / 4086

页数：29

共 65 条

[1]

Agirre E., Martinez D., Knowledge sources for word sense disambiguation, International conference on text, speech and dialogue, pp. 1-10, (2001)

[2]

Alian M., Awajan A., Al-Kouz A., Word sense disambiguation for arabic text using wikipedia and vector space model, Int J Speech Technol, 19, 4, pp. 857-867, (2016)

[3]

Anirban D., Nitya B., Van Breugel L.M., Sonali S., Bhupen B., Hiranya S., Udeme-Abasi N., Ahmed M., Subhankar P., Youtube as a source of medical and epidemiological information during COVID-19 pandemic: a cross-sectional study of content across six languages around the globe, Cureus, 12, 6, (2020)

[4]

Aoshima M., Yata K., A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data, Ann Inst Stat Math, 66, 5, pp. 983-1010, (2014)

[5]

Ashiq W., Urdu Word Sense Disambiguation Using Siamese Neural Networks, (2021)

[6]

Aung N.T.T., Soe K.M., Thein N.L., A word sense disambiguation system using naïve bayesian algorithm for Myanmar language, Int J Sci Eng Res, 2, 9, pp. 1-6, (2011)

[7]

Banerjee S., Pedersen T., Et al., Extended gloss overlaps as a measure of semantic relatedness, Ijcai, 3, pp. 805-810, (2003)

[8]

Banerjee E., Bansal A., Jha G.N., Issues in chunking parallel corpora: Mapping hindi-english verb group in ilci, Workshop Programme, (2014)

[9]

Baruah N., Gogoi A., Sarma S.K., Borah R., Utilizing corpus statistics for assamese word sense disambiguation, Advances in computing and network communications, pp. 271-283, (2021)

[10]

Basile P., de Gemmis M., Lops P., Semeraro G., Combining knowledge-based methods and supervised learning for effective Italian word sense disambiguation, In: Proceedings of the 2008 Conference on Semantics in Text Processing., pp. 5-16, (2008)

← 1 2 3 4 5 6 7 →