Knowledge-based sentence semantic similarity: algebraical properties

被引:7
作者
Oussalah, Mourad [1 ]
Mohamed, Muhidin [2 ]
机构
[1] Univ Oulu, CMVS, Fac Informat Technol & Elect Engn, Oulu 90014, Finland
[2] Aston Univ, Operat & Informat Management Dept, Birmingham, W Midlands, England
关键词
Sentence semantic similarity; Part-of-speech conversion; WordNet; CatVar; WORDNET;
D O I
10.1007/s13748-021-00248-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Determining the extent to which two text snippets are semantically equivalent is a well-researched topic in the areas of natural language processing, information retrieval and text summarization. The sentence-to-sentence similarity scoring is extensively used in both generic and query-based summarization of documents as a significance or a similarity indicator. Nevertheless, most of these applications utilize the concept of semantic similarity measure only as a tool, without paying importance to the inherent properties of such tools that ultimately restrict the scope and technical soundness of the underlined applications. This paper aims to contribute to fill in this gap. It investigates three popular WordNet hierarchical semantic similarity measures, namely path-length, Wu and Palmer and Leacock and Chodorow, from both algebraical and intuitive properties, highlighting their inherent limitations and theoretical constraints. We have especially examined properties related to range and scope of the semantic similarity score, incremental monotonicity evolution, monotonicity with respect to hyponymy/hypernymy relationship as well as a set of interactive properties. Extension from word semantic similarity to sentence similarity has also been investigated using a pairwise canonical extension. Properties of the underlined sentence-to-sentence similarity are examined and scrutinized. Next, to overcome inherent limitations of WordNet semantic similarity in terms of accounting for various Part-of-Speech word categories, a WordNet "All word-To-Noun conversion" that makes use of Categorial Variation Database (CatVar) is put forward and evaluated using a publicly available dataset with a comparison with some state-of-the-art methods. The finding demonstrates the feasibility of the proposal and opens up new opportunities in information retrieval and natural language processing tasks.
引用
收藏
页码:43 / 63
页数:21
相关论文
共 50 条
[1]  
Achananuparp P., 2008, P QAWEB WORKSH
[2]   Multi-document Text Summarization: SimWithFirst Based Features and Sentence Co-selection Based Evaluation [J].
Ali, Md. Mohsin ;
Ghosh, Monotosh Kumar ;
Abdullah-Al-Mamun .
INTERNATIONAL CONFERENCE ON FUTURE COMPUTER AND COMMUNICATIONS, PROCEEDINGS, 2009, :93-96
[3]  
Allan J., 2003, P 26 ANN INT ACM SIG, P314, DOI [DOI 10.1145/860435.860493, 10.1145/860435.860493]
[4]  
Balasubramanian Niranjan, 2007, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P813, DOI 10.1145/1277741.1277922
[5]  
Banches RE., 2013, TEXT MINING MATLAB, DOI [10.1007/978-1-4614-4151-9, DOI 10.1007/978-1-4614-4151-9]
[6]  
Bao Jun-Peng, 2003, Journal of Software, V14, P1753
[7]  
Bawakid A., 2010, PROC CIS2010, P1
[8]  
BORWEIN J. M., 1987, Pi and the AGM: A Study in the Analytic Number Theory and Computational Complexity
[9]  
Budanitsky A, 2006, COMPUT LINGUIST, V32, P13, DOI 10.1162/coli.2006.32.1.13
[10]   Evolution of Semantic Similarity-A Survey [J].
Chandrasekaran, Dhivya ;
Mago, Vijay .
ACM COMPUTING SURVEYS, 2021, 54 (02)