A novel hybrid methodology for computing semantic similarity between sentences through various word senses

被引:0
作者
Ahmad F. [1 ,2 ]
Faisal D.M. [1 ,2 ]
机构
[1] Department of Computer Application, Integral University, Lucknow
[2] Department of Computer Application, Integral University, Lucknow
来源
International Journal of Cognitive Computing in Engineering | 2022年 / 3卷
关键词
Corpus; Lexical database; Natural language processing; Semantic search; Semantic similarity; Word embedding; Word overlap; WordNet;
D O I
10.1016/j.ijcce.2022.02.001
中图分类号
学科分类号
摘要
In the area of natural language processing, measuring sentence similarity is an essential problem. Searching for semantic meaning in natural language is a related issue. The task of measuring sentence similarity is to find semantic symmetry in two sentences, not matter how they are arranged. It is important to measure the similarity of sentences accurately. To compute the similarity between sentences, existing methods have been constructed from approaches for large texts. Since these methods work in very high-dimensional spaces, they are inefficient, require human input, and are not flexible enough for some applications. In this study, we propose a hybrid method (HydMethod) which considers not only semantic information including lexical databases, word embeddings, and corpus statistics, but also implied word order information. With lexical databases, our method models human common sense knowledge, and that knowledge can then be adapted to be used in different domains with the incorporation of corpus statistics. Therefore, the methodology is applicable across several domains. As part of our experiments, we used two standard datasets - Pilot Short Text Semantic Similarity Benchmark and MS paraphrase - in order to demonstrate the efficacy of our proposed method. As a result, the proposed method outperforms the existing approaches when tested on these two datasets, giving the highest correlation value for both word and sentence similarity. Moreover, it achieves a maximum of 32% higher increase than only using word vector or WorldNet based methodology. With Rubenstein and Goodenough word & sentence pairs, our algorithm's similarity measure shows a high Pearson correlation coefficient of 0.8953. © 2022
引用
收藏
页码:58 / 77
页数:19
相关论文
共 32 条
[1]  
Pawar A., Mago V., Calculating the similarity between words and sentences using a lexical database and corpus statistics, IEEE Transactions on Knowledge and Data Engineering, (2018)
[2]  
Kenter T., Rijke M.D., Short text similarity with word embeddings, Proceedings of the 24th ACM international on conference on information and knowledge management, Melbourne, Australia, (2015)
[3]  
Harispe S., Ranwez S., Janaqi S., Montmain J., Semantic similarity from natural language and ontology analysis, (2015)
[4]  
Fellbaum C., WordNet, (1998)
[5]  
Li Y., McLean D., Bandar Z.A., O'shea J.D., Crockett K., Sentence similarity based on semantic nets and corpus statistics, IEEE transactions on knowledge and data engineering, 18, 8, pp. 1138-1150, (2006)
[6]  
Aliguyev R.M., A new sentence similarity measure and sentence based extractive technique for automatic text summarization, Expert Systems with Applications, 36, pp. 7764-7772, (2009)
[7]  
Farouk M., Ishizuka M., Bollegala D., Graph Matching based Semantic Search Engine, 12th international conference on metadata and semantics research, Cyprus, (2018)
[8]  
De Boni M., Manandhar S., The use of sentence similarity as a semantic relevance metric for question answering, Proceedings of the AAAI symposium on new directions in question answering, stanford, (2003)
[9]  
Mikolov T., Chen K., Corrado G., Dean J.
[10]  
Zhu Y., Yan E., Wang F., Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec, BMC Medical Informatics and Decision Making, 17, (2017)